1. For an overview of the categorization aspect of genres, see Zymner (2003, 99–104Zymner, Rüdiger. 2003. Gattungstheorie. Probleme und Positionen der Literaturwissenschaft. Paderborn: mentis.).
2. For an introduction to the background and goals of digital literary stylistics, see the website (SIG-DLS n.d.SIG-DLS. n.d. “Goals.” Digital Literary Stylistics (SIG-DLS). http://web.archive.org/web/20221023111813/https://dls.hypotheses.org/activities/about/about.) of the corresponding special interest group of the Alliance of Digital Humanities Organizations (ADHO).
3. See the call for papers (CLiGS n.d.CLiGS. n.d. “Call for Papers: Digital Stylistics in Romance Studies and beyond.” CLiGS – Computergestützte literarische Gattungsstilistik. Accessed October 23, 2022. http://web.archive.org/web/20221023113851/https://cligs.hypotheses.org/digital-stylistics-in-romance-studies-and-beyond/call-for-papers.) and the conference proceedings to be published in 2023 (Hesselbach et al., forthcomingHesselbach, Robert, José Calvo Tello, Ulrike Henny-Krahmer, Christof Schöch, and Daniel Schlör, eds. Forthcoming. Digital Stylistics in Romance Studies and Beyond. Heidelberg: Heidelberg University Publishing.).
4. See, for instance, the influential studies of Jockers (2013Jockers, Matthew L. 2013. Macroanalysis. Digital Methods & Literary History. Topics in the Digital Humanities. Urbana, Chicago, and Springfield: University of Illinois Press.) and Underwood (2019Underwood, Ted. 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: The University of Chicago Press.).
5. One outcome of the project is the Textbox, a collection of small to medium-sized corpora of literary texts in Romance languages of different genres, which are published on GitHub and free to reuse (Schöch, Calvo Tello et al. 2018Schöch, Christof, José Calvo Tello, Ulrike Henny-Krahmer, and Stefanie Popp, eds. 2018. “The CLiGS textbox.” Version 4.0.0. Zenodo. https://doi.org/10.5281/zenodo.597430., 2019Schöch, Christof, José Calvo Tello, Ulrike Henny-Krahmer, and Stefanie Popp. 2019. “The CLiGS Textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in TEI XML.” Journal of the Text Encoding Initiative. Rolling Issue. https://doi.org/10.4000/jtei.2085.). Beyond the Textbox, the following more extensive individual corpora resulting from the CLiGS project are worth mentioning: the “Corpus of Novels of the Spanish Silver Age” (CoNSSA, Calvo Tello 2021aCalvo Tello, José, ed. 2021a. “Corpus of Novels of the Spanish Silver Age (CoNSSA).” Version 1.0.0. GitHub.com. Accessed December 9, 2022. https://github.com/cligs/conssa.) and a text collection of over 800 French dramatic texts (Schöch 2017bSchöch, Christof, ed. 2017b. “theatreclassique.” Accessed December 9, 2022. https://github.com/cligs/theatreclassique.) derived from the corpus Théâtre Classique (Fièvre 2007–2022Fièvre, Paul, ed. 2007–2022. “Théâtre Classique.” Accessed December 10, 2022. https://www.theatre-classique.fr.). The latter is also available as part of the multilingual DraCor corpus, where it is called FreDraCor (Milling, Fischer, and Göbel 2021Milling, Carsten, Frank Fischer, and Mathias Göbel, eds. 2021. “French Drama Corpus (FreDraCor): A TEI P5 Version of Paul Fièvre's ʻThéâtre Classiqueʼ Corpus.” GitHub.com. Accessed December 9, 2022. https://github.com/dracor-org/fredracor.).
6. For general literary histories on Spanish-American literature that also cover the nineteenth-century novel and for specialized monographs, see, among others, Alegría (1959Alegría, Fernando. 1959. Breve historia de la novela hispanoamericana. México: Ed. de Andrea.), Anderson Imbert (1954Anderson Imbert, Enrique. 1954. Historia de la literatura hispanoamericana. México: Fondo de Cultura Económica.), Dill (1999Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.), Gálvez (1990Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.), Goić (2009Goić, Cedomil. 2009. Brevísima relación de la historia de la novela hispanoamericana. Madrid: Biblioteca Nueva.), Íñigo Madrigal, Alvar, and Aínsa (1982Íñigo Madrigal, Luis, Manuel Alvar, and Fernando Aínsa, eds. 1982. Historia de la literatura hispanoamericana. 3 vols. Madrid: Cátedra.), Lindstrom (2004Lindstrom, Naomi. 2004. Early Spanish American Narrative. Austin: University of Texas Press.), Rössner (2007Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.), and Sánchez (1953Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.).
7. Rivas (1990Rivas, Mercedes. 1990. Literatura y esclavitud en la novela cubana del siglo XIX. Sevilla: Escuela de Estudios Hispano-Americanos.), for instance, establishes the concept of the anti-slavery novel based on seven different novels. Gnutzmann (1998Gnutzmann, Rita. 1998. La novela naturalista en Argentina (1880–1900). Amsterdam, Atlanta: Rodopi.) as well studies the Argentine naturalistic novel with a corpus of seven texts.
8. For example, Löfquist (1995Löfquist, Eva. 1995. La novela histórica chilena dentro del marco de la novelística chilena. 1843–1879. Göteborg: Acta Universitatis Gothoburgensis.) on the Chilean historical novel, Read (1939Read, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.) on the Mexican historical novel, or Schlickers (Schlickers 2003Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.) on the Spanish-American naturalistic novel. Another approach is to consider the novel as a whole for an individual country and for a certain period. Lichtblau (1959Lichtblau, Myron I. 1959. The Argentine Novel in the Nineteenth Century. New York: Hispanic Institute in the United States.), for example, studies the nineteenth-century novel in Argentina, and Molina (2011Molina, Hebe Beatriz. 2011. Como crecen los hongos. La novela argentina entre 1838 y 1872. Buenos Aires: Teseo.) the Argentine novel between 1838 and 1872.
9. There are two studies based on subparts of the corpus in which the internal structure of the texts was exploited: Schöch, Henny et al. (2016Schöch, Christof, Ulrike Henny, José Calvo Tello, Daniel Schlör, and Stefanie Popp. 2016. “Topic, Genre, Text. Topics im Textverlauf von Untergattungen des spanischen und hispanoamerikanischen Romans (1880–1930).” In DHd 2016. Modellierung, Vernetzung, Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts, 235–239. Leipzig: Universität Leipzig. https://doi.org/10.5281/zenodo.4645380.) on the development of topics in different parts of the novels, depending on the subgenres, and Henny-Krahmer (2018Henny-Krahmer, Ulrike. 2018. “Exploration of Sentiments and Genre in Spanish American Novels.” In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Mexico City, 26–29 June 2018, 399–403. Mexico City: Red de Humanidades Digitales. https://web.archive.org/web/20200702225303/https://dh2018.adho.org/exploration-of-sentiments-and-genre-in-spanish-american-novels/.) on the connection of sentiments and direct speech versus narrated text in different subgenres.
10. The web-based edition of this dissertation can be accessed at https://side17.i-d-e.de/.
11. In the meantime, for example, the dissertation of my co-doctoral student José Calvo Tello from the CLiGS project has been published (Calvo Tello 2021bCalvo Tello, José. 2021b. The Novel in the Spanish Silver Age. A Digital Analysis of Genre Using Machine Learning. Digital Humanities Research, vol. 4. Bielefeld: Bielefeld University Press. https://doi.org/10.14361/9783839459256.), the content of which could not be considered here because the dissertations were prepared at the same time. Due to the joint research project in which the two theses were written, there are, of course, common foundations and references between them.
12. For a general introduction to the notion of genre in literary and cultural studies, see Frow (2015Frow, John. 2015. Genre. The New Critical Idiom. 2nd ed. London: Routledge.). Genres, in general, are relevant to all the fields in the humanities, for example, linguistics, history, cultural studies, media studies, musicology, and art history. Introductions to genre studies from non-literary backgrounds include Lacey (2000Lacey, Nick. 2000. Narrative and Genre: Key Concepts in Media Studies. Basingstoke: Macmillan.) (media studies) and Bawarshi and Reiff (2010Bawarshi, Anis S., and Mary Jo Reiff. 2010. Genre: An Introduction to History, Theory, Research, and Pedagogy. West Lafayette: Parlor Press and the WAC Clearinghouse. https://web.archive.org/web/20230210055352/https://wac.colostate.edu/docs/books/bawarshi_reiff/genre.pdf.) (rhetorics and applied linguistics).
13. See, amongst others, Fubini (1971, 24–27Fubini, Mario. 1971. Entstehung und Geschichte der literarischen Gattungen. Tübingen: Max Niemeyer Verlag.) and García Berrio and Huerta Calvo (2009, 94García Berrio, Antonio, and Javier Huerta Calvo. 2009. Los géneros literarios: sistema e historia. Una introducción. 5th ed. Madrid: Cátedra.). See also Behrens (1940Behrens, Irene. 1940. Die Lehre von der Einteilung der Dichtkunst. Vornehmlich vom 16. bis 19. Jahrhundert. Halle/Saale: Max Niemeyer Verlag.), who examines the history of the traditional classification of literature into lyric, epic, and drama and finds that triadic classifications in themselves have been found since Plato.
14. Overviews of genre theory in the twentieth century include Dubrow ([1982] 2014Dubrow, Heather. (1982) 2014. Genre. Reprint, London: Routledge.), Duff (2010Duff, David, ed. 2010. Modern Genre Theory. Harlow: Longman.), and Gymnich, Neumann, and Nünning (2007Gymnich, Marion, Birgit Neumann, and Ansgar Nünning, eds. 2007. Gattungstheorie und Gattungsgeschichte. Trier: WVT.). The latter also focus on the relationship between genre theory and history. On this aspect, see as well the earlier publication by Lamping (1990Lamping, Dieter, ed. 1990. Gattungstheorie und Gattungsgeschichte: ein Symposium. Wuppertal: Bergische Universität, Gesamthochschule Wuppertal.).
15. An early discussion of linguistic text types can be found in Gülich and Raible (1972Gülich, Elisabeth, and Wolfgang Raible. 1872. Textsorten. Differenzierungskriterien aus linguistischer Sicht. Frankfurt am Main: Athenäum-Verlag.). A recent overview is given in Gansel (2011Gansel, Christina. 2011. Textsortenlinguistik. Göttingen: Vandenhoeck & Ruprecht.).
16. A well-known study of genre variation in English texts based on statistical methods is summarized in Biber (1993bBiber, Douglas. 1993b. “The Multi-Dimensional Approach to Linguistic Analyses of Genre Variation: An Overview of Methodology and Findings.” Computers in the Humanities 26 (5–6): 331–345. https://doi.org/10.1007/BF00136979.). Another influential study of the automatic detection of text genre from computational linguistics is Kessler, Numberg, and Schütze (1997Kessler, Brett, Geoffrey Numberg, and Hinrich Schütze. 1997. “Automatic detection of text genre.” In ACL '98/EACL '98: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 32–38. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dx.doi.org/10.3115/976909.979622.).
17. For an introduction to text categorization from the perspective of natural language processing, see Manning and Schütze (1999, 575–608Manning, Christopher D., and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, Mass: The MIT Press.).
18. Manning and Schütze (1999, 575Manning, Christopher D., and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, Mass: The MIT Press.) use the term as a synonym for “classification”. Oakes, in contrast, differentiates the two terms: “Classification and categorization are distinct concepts. Classification is the assignment of objects to predefined classes, while categorization is the initial identification of these classes, and hence must take place before classification” (Oakes 2003, 95Oakes, Michael P. 2003. Statistics for corpus linguistics. Edinburgh: Edinburgh Univ. Press.). Oakes bases his view on Thompson and Thompson (1991Thompson, W., and B. Thompson. 1991. “Overturning the Category Bucket.” Byte 16 (1): 249–256.).
19. For an overview of the main research areas and scope of digital literary studies, see Siemens and Schreibman (2008Siemens, Ray, and Susan Schreibman, eds. 2008. A Companion to Digital Literary Studies. Oxford: Blackwell.).
20. See, for example, the Journal of Computational Literary Studies (JCLS; Gius, Schöch, and Trilcke 2022–2023Gius, Evelyn, Christof Schöch, and Peer Trilcke, eds. 2022–2023. Journal of Computational Literary Studies (JCLS). Darmstadt: Universitäts- und Landesbibliothek Darmstadt. https://web.archive.org/web/20230210112118/https://jcls.io/.), whose first issue appeared in 2022, and the annual conference on the same topic that has been held since 2022. A proposed definition or description of the field can also be found on the website of the Kompetenzzentrum – Trier Center for Digital Humanities (2023Kompetenzzentrum – Trier Center for Digital Humanities. 2023. “Computational Literary Studies. A Bird's Eye View of Literature.” https://web.archive.org/web/20230210111714/https://tcdh.uni-trier.de/en/thema/computational-literary-studies. ).
21. For the traditional key issues of stylometry see Holmes (1998Holmes, David I. 1998. “The Evolution of Stylometry in Humanities Scholarship.” Literary and Linguistic Computing 13 (3): 111–117. https://doi.org/10.1093/llc/13.3.111.). Stylometric studies focusing on literary genre are, for example, Binongo and Smith (1999Binongo, José Nilo G., and M. W. A. Smith. 1999. “A Bridge Between Statistics and Literature: The Graphs of Oscar Wilde’s Literary Genres.” Journal of Applied Statistics 26 (7): 781–787. https://doi.org/10.1080/02664769922025.) and Hettinger et al. (2015 Hettinger, Lena, Martin Becker, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2015. “Genre Classification on German Novels.” In Proceedings of the 26th International Workshop on Database and Expert Systems Applications (DEXA), 249–253. Valencia. https://doi.org/10.1109/DEXA.2015.62., 2016 Hettinger, Lena, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2016. “Classification of Literary Subgenres.” In DHd2016. Modellierung – Vernetzung – Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts. Universität Leipzig 7. bis 12. März 2016, 160–164. Duisburg: nisaba verlag. https://doi.org/10.5281/zenodo.4645368.). The former use linguistic features to differentiate between essays and plays written by Wilde, the latter classify various subgenres of the German novel based on stylometric, topic-, and network-based features.
22. A clear summary of the main concerns of literary genre theory in the twentieth and twenty-first centuries and the principal theoretical positions, particularly in the German-speaking area, can be found in Zymner (2010, 213–219Zymner, Rüdiger, ed. 2010. Handbuch Gattungstheorie. Stuttgart: J.B. Metzler.). The three genre theoretical issues addressed here have been selected because they were highlighted by Zymner and are considered relevant for the discussion of genre analysis and corpus design in digital stylistics.
23. With the example of the work “La folie du jour”, written by Maurice Blanchot, Derrida shows how literary texts resist their categorization in terms of genre: “The genre has always in all genres been able to play the role of order’s principle: resemblance, analogy, identity and difference, taxonomic classification, organization and genealogical tree, order of reason, order of reasons, sense of sense, truth of truth, natural light and sense of history. Now, the test of An Account? brought to light the madness of genre. Madness has given birth to and thrown light on the genre in the most dazzling, most blinding sense of the word. And in the writing of An Account?, in literature, satirically practicing all genres, imbibing them but never allowing herself to be saturated with a catalog of genres, she, madness, has started spinning Peterson’s genre-disc like a demented sun. And she does not only do so in literature, for in concealing the boundaries that sunder mode and genre, she has also inundated and divided the borders between literature and its others” (Derrida 1980, 81Derrida, Jacques. 1980. “The Law of Genre.” Translated by Avital Ronell. Critical Inquiry 7 (1): 55–81.).
24. “What interests me is that this re-mark—ever possible for every text, for every corpus of traces—is absolutely necessary for and constitutive of what we call art, poetry, or literature. [...] Can one identify a work of art, of whatever sort, but especially a work of discursive art, if it does not bear the mark of a genre, if it does not signal or mention it or make it remarkable in any way?” (Derrida 1980, 64Derrida, Jacques. 1980. “The Law of Genre.” Translated by Avital Ronell. Critical Inquiry 7 (1): 55–81.).
25. For an overview of different genre theories associated with the realistic position, see Hempfer (1973, 56–122Hempfer, Klaus W. 1973. Gattungstheorie. Information und Synthese. München: Fink.).
26. See for example Cranenburgh and Koolen (2015Cranenburgh, Andreas van, and Corina Koolen. 2015. “Identifying Literary Texts with Bigrams.” In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, 58–67. Denver, Colorado: Association for Computational Linguistics. http://dx.doi.org/10.3115/v1/W15-0707.). The authors analyzed the literariness of general fiction and genre fiction using machine learning based on word bi-grams.
27. The former are called “Schreibweisen” by Hempfer and the latter “genres” in a narrower sense (Hempfer 1973, 27Hempfer, Klaus W. 1973. Gattungstheorie. Information und Synthese. München: Fink.).
28. The corpus-specific model for generic terms is presented in chapter 3.2.3 below.
29. See the chapters 3.2.3 and 3.3.4 on the assignment of subgenre labels to the works in the bibliography and the corpus.
30. In his set of terms, Hempfer, for instance, also includes the term “Sammelbegriff” (“collective term”), which he uses to designate logically disjunct groups of texts established on any characteristic: “Genauso wie man ‘Kopf’, ‘Apfel’, ‘Platz’, ‘Tisch’ u.ä. mit dem Prädikator ‘rund’ belegen und somit eine Klasse von Gegenständen bilden kann, der die Eigenschaft ‘rund’ zukommt, kann man Texte aufgrund ihrer Länge, des Vorhandenseins oder Fehlens eines Erzählers, der Fiktionalität oder Nichtfiktionalität, der Tatsache, ob sie in Vers oder Prosa geschrieben sind, usw., einer bestimmten Textklasse zuordnen. Wie das Beispiel der Klassenbildung mit dem Prädikator ‘rund’ darlegen sollte, braucht eine solche Klassifizierung keineswegs aufgrund von für die dergestalt klassifizierten Objekte wesentlicher Eigenschaften zu erfolgen, und dieselben Objekte können, je nach der Eigenschaft, die man wählt, verschiedenen Klassen zugeordnet werden” (Hempfer 1973, 28Hempfer, Klaus W. 1973. Gattungstheorie. Information und Synthese. München: Fink.). Schaeffer too, when discussing the problems of differentiating between theoretical and historical genres, notes that an infinite number of traits can be chosen to compare texts: “En second lieu, et inversement, le nombre de caractéristiques selon lesquelles on peut regrouper deux textes quelconques est indéfini sinon infini. Cela est dû au fait que, lorsqu’on compare deux textes, on ne part pas de leur identité numérique (toujours simple), mais de ce que Luis J. Prieto appelle leur identité spécifique (défini comme un ensemble de caractéristiques non contradictoires). Or, «comme chaque objet possède un nombre infini de caractéristiques, il peut posséder un nombre infini d’identités spécifiques; et comme n’importe quelle caractéristique que présente un objet donné peut toujours aussi faire partie des caractéristiques d’un autre objet, chaque objet peut partager n’importe laquelle de ses identités spécifiques avec un nombre infini d’autres objets 3»” (Schaeffer 1983, 67–68Schaeffer, Jean-Marie. 1983. Qu’est-ce qu’un genre littéraire? Paris: Seuil.). The potential arbitrariness of the textual features that describe a text category is thus an important point to consider when text classification as a computational activity is used to determine literary genres.
31. He differentiates between “detective fiction”, “science fiction”, and “Gothic” (Underwood 2016, 4Underwood, Ted. 2016. “The Life Cycles of Genres.” Journal of Cultural Analytics 2 (2). https://doi.org/10.22148/16.005.).
32. Underwood’s method is predictive modeling with L2-regularized logistic regression based on the top 10,000 word features in the text collection (Underwood 2016, 7Underwood, Ted. 2016. “The Life Cycles of Genres.” Journal of Cultural Analytics 2 (2). https://doi.org/10.22148/16.005.). The findings of his article from 2016 have been integrated into his book “Distant Horizons” (Underwood 2019, 34–67Underwood, Ted. 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: The University of Chicago Press.).
33. The case of Underwood is an example of a
clear definition and transparent documentation of the target genre
convention: “To investigate these questions, I’ve gathered lists of
titles assigned to a genre in eighteen different sites of reception. Some
of these lists reflect recent scholarly opinion, some were defined by
writers or editors earlier in the twentieth century, others reflect the
practices of many different library catalogers (see Appendix A) [...] By
comparing groups of texts associated with different sites of reception
and segments of the timeline, we can ask exactly how stable different
categories have been” (Underwood 2016, 4Underwood, Ted. 2016. “The Life Cycles of Genres.”
Journal of Cultural Analytics 2 (2). https://doi.org/10.22148/16.005.). In their classification of subgenres of the
German novel, Hettinger et al. also explain that they analyze genre
attributions made by literary scholars: “Literary scholars and common
readers use labels like educational novel, crime novel or adventure novel
to organize the large domain of fiction. In both discourses the use of
these categories is well-established even though they are evolving and
tend to be inconsistent. [...] Our corpus consists of 628 German novels
mainly from the nineteenth century [...]. The novels have been manually
labeled according to their subgenre after research in literary lexica and
handbooks” (Hettinger
et al. 2016 Hettinger, Lena, Isabella Reger, Fotis Jannidis,
and Andreas Hotho. 2016. “Classification of Literary Subgenres.” In DHd2016.
Modellierung – Vernetzung – Visualisierung. Die Digital Humanities als
fächerübergreifendes Forschungsparadigma. Konferenzabstracts. Universität
Leipzig 7. bis 12. März 2016, 160–164. Duisburg: nisaba verlag. https://doi.org/10.5281/zenodo.4645368.). Kim et al. also explain the provenience of the
genre labels in their investigation of the prototypical emotion
developments in literary genres: “We collect 2113 books from Project
Gutenberg that belong to five genres found in the Brown corpus [...]
namely adventure (585 books), romance (383 books), mystery (380 books),
science fiction (562 books), and humorous fiction (203 books). [...] The
selection is based on the Library of Congress Subject Headings in the
metadata” (Kim,
Padó, and Klinger 2017Kim, Evgeny, Sebastian Padó, and Roman Klinger.
2017. “Prototypical Emotion Developments in Literary Genres.” DH2017.
Conference Abstracts. Montréal: McGill University & Université de
Montréal. https://web.archive.org/web/20230211105146/https://dh2017.adho.org/abstracts/203/203.pdf.). A case in which the provenience of the
generic assignments to the texts remains implicit is Schöch (2017cSchöch, Christof. 2017c. “Topic Modeling Genre: An
Exploration of French Classical and Enlightenment Drama.” Digital Humanities
Quarterly 11 (2). https://web.archive.org/web/20230211105751/http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html.). Schöch analyses
subgenres of French Classical and Enlightenment Drama by applying topic
modeling to a corpus of plays that was initially curated by Paul Fièvre
(called the “Théâtre classique” collection). The data which is used in
the analysis is presented in detail, and the subgenres are also
mentioned, but it is not made explicit where the labels that are finally
used come from: “detailed metadata has been added to the texts relating,
for instance, to their historical genre label (e.g. comédie
héroique, tragédie, or opéra-ballet)
as well as the type of thematic and regional inspiration [...]. A large
part of this information can fruitfully be used when applying Topic
Modeling to this text collection. [...] Finally, all the plays included
belong to one of the following subgenres: comedy, tragedy or tragicomedy”
(para. 9–10Schöch, Christof. 2017c. “Topic Modeling Genre: An
Exploration of French Classical and Enlightenment Drama.” Digital Humanities
Quarterly 11 (2). https://web.archive.org/web/20230211105751/http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html.).
Looking into the TEI collection on GitHub (e.g., https://github.com/cligs/theatreclassique/blob/master/tei/tc0001.xml,
accessed November 28, 2020), it can be noticed that there is a general
subgenre assignment in the TEI header (<term type="genre">Comédie</term>
), but no further information about its
provenience is given. As Schöch mentions the historical subgenre labels
in his text, it can be assumed that these are the source. The
classification of works into dramatic subgenres is probably less debated
than into subgenres of the novel. Likely, dramatic subgenre labels are
also more often explicitly given on title pages than in the case of
novels. Still, it would be better to make the generic convention that is
analyzed more explicit because it makes a difference whether labels
assigned by librarians or literary historians or genre labels from the
historical paratexts of the works are used.
34. An overview of proposals that have been made for corpus building in literary genre studies is given in chapter 3.3 (“Text Corpus”) below.
35. However, Jannidis himself comments on the status of the work steps: “Diese hier skizzierte Vorgehensweise ist natürlich stark idealisiert. Nicht selten steht am Anfang nicht die These, sondern ein auffälliger Befund in Texten, der dann als Indikator für eine These gedeutet wird. Doch selbst in diesem Fall einer induktiven Vorgehensweise ergibt sich zuletzt eine ähnliche Forschungsstrategie, wie hier skizziert” (Jannidis 2010, 110Jannidis, Fotis. 2010. “Methoden der computergestützten Textanalyse.” In Methoden der literatur- und kulturwissenschaftlichen Textanalyse, edited by Ansgar Nünning and Vera Nünning, 109–132. Stuttgart, Weimar: J.B. Metzler.).
36. For an approach to automatically recognize characters in German language novels, see Jannidis et al. (2015Jannidis, Fotis, Markus Krug, Martin Toepfer, Frank Puppe, Isabella Reger, and Lukas Weimer. 2015. “Automatische Erkennung von Figuren in deutschsprachigen Romanen.” In DHd2015. Konferenzabstracts. https://doi.org/10.5281/zenodo.4623273.). Barth and Viehhauser (2017Barth, Florian, and Gabriel Viehhauser. 2017. “Digitale Modellierung literarischen Raums.” In DHd2017. Konferenzabstracts. https://doi.org/10.5281/zenodo.4622732.) made the first attempts to formalize concepts of literary space.
37. In the context of subgenres of Spanish-American novels, see, for instance, Gnutzmann (1998Gnutzmann, Rita. 1998. La novela naturalista en Argentina (1880–1900). Amsterdam, Atlanta: Rodopi.) and Schlickers (2003Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.) on the naturalistic novel, Löfquist (1995Löfquist, Eva. 1995. La novela histórica chilena dentro del marco de la novelística chilena. 1843–1879. Göteborg: Acta Universitatis Gothoburgensis.) and Read (1939Read, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.) on the historical novel, and Rivas (1990Rivas, Mercedes. 1990. Literatura y esclavitud en la novela cubana del siglo XIX. Sevilla: Escuela de Estudios Hispano-Americanos.), Rosell (1997Rosell, Sara V. 1997. La novela antiesclavista en Cuba y Brasil, siglo XIX. Madrid: Ed. Pliegos.), and Sparrow de García Barrío (1977Sparrow de García Barrío, Constance. 1977. The abolitionist novel in nineteenth century Cuba. Baltimore: Morgan State College.) on the anti-slavery novel.
38. Suárez-Murias (1963Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.), for instance, is a study of the Spanish-American romantic novel by country, in which subtypes of the romantic novel are presented, such as sentimental novels, historical novels, or novels of customs. A comprehensive overview of thematic subtypes of Spanish-American novels is given in Sánchez (1953Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.). In Molina’s (2011Molina, Hebe Beatriz. 2011. Como crecen los hongos. La novela argentina entre 1838 y 1872. Buenos Aires: Teseo.) study of the early nineteenth-century Argentine novel, four classes of novels are established (“novela histórica”, “novela política”, “novela socializadora”, “novela sentimental”).
39. The tradition to directly contrast genres is stronger in linguistics than in literary studies. For text linguistic contrastive genre analyses, see, for example, Adamzik (2001Adamzik, Kirsten, ed. 2001. Kontrastive Textologie: Untersuchungen zur deutschen und französischen Sprach- und Literaturwissenschaft. Tübingen: Stauffenburg-Verlag.), Danneberg and Niederhauser (1998Danneberg, Lutz, and Jürg Niederhauser, eds. 1998. Darstellungsformen der Wissenschaften im Kontrast: Aspekte der Methodik, Theorie und Empirie. Tübingen: Narr.), Gnutzmann (1990Gnutzmann, Claus. 1990. Kontrastive Linguistik. Frankfurt am Main: Lang.), Kaiser (2002Kaiser, Dorothee. 2002. Wege zum wissenschaftlichen Schreiben. Eine kontrastive Untersuchung zu studentischen Texten aus Venezuela und Deutschland. Tübingen: Stauffenburg-Verlag., 2008Kaiser, Dorothee. 2008. “Ensayo o artículo científico? Una comparación de tradiciones discursivas en Alemania y Latinoamérica.” In Le style, c’est l’homme: unité et pluralité du discours scientifique dans les langues romanes, edited by Ursula Reutner, 285–304. Frankfurt am Main: Lang.), and Theisen (2016Theisen, Joachim. 2016. Kontrastive Linguistik. Tübingen: Narr.). In literary studies, contrastive analyses are, in particular, used in comparative studies on different cultural and linguistic literary systems. See Lamping (2010Lamping, Dieter. 2010. “Komparatistische Gattungsforschung.” In Handbuch Gattungstheorie, edited by Rüdiger Zymner, 270–273. Stuttgart, Weimar: J.B. Metzler.) for an overview of comparative genre studies and Jacobs (1986Jacobs, Jürgen. 1986. “Bildungsroman und Pikaroroman. Versuch einer Abgrenzung.” In Der moderne deutsche Schelmenroman. Interpretationen, edited by Gerhart Hoffmeister, 9–18. Amsterdamer Beiträge zur neueren Germanistik, vol. 20. Amsterdam: Rodopi.) for a case study on the picaresque and the education novel.
40. The concept of distinctiveness or keyness, which aims to find words characteristic of one group of text compared to another, is a general one, not limited to analyses of genre. See Burrows (2007Burrows, John. 2007. “All the Way Through: Testing for Authorship in Different Frequency Strata.” Literary and Linguistic Computing 22 (1): 27–47. https://doi.org/10.1093/llc/fqi067.), who developed the Zeta-measure for questions of authorship, and Scott (1997Scott, Mike. 1997. “PC analysis of key words — and key key words.” System 25 (2): 233–245.) for a general approach to keyword extraction.
41. For an analysis of the text features that are decisive in detecting the nationality of authors of Spanish language novels, see, for instance, Zehe et al. (2018Zehe, Albin, Daniel Schlör, Ulrike Henny-Krahmer, Martin Becker, and Andreas Hotho. 2018. “A White-Box Model for Detecting Author Nationality by Linguistic Differences in Spanish Novels.” In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Mexico City, 26–29 June 2018, 519–522. Mexico City: Red de Humanidades Digitales. https://web.archive.org/web/20230212050806/https://dh2018.adho.org/en/a-white-box-model-for-detecting-author-nationality-by-linguistic-differences-in-spanish-novels/.). Sentiment features that are important in the classification of subgenres of nineteenth-century Spanish-American novels were explored in Henny-Krahmer (2018Henny-Krahmer, Ulrike. 2018. “Exploration of Sentiments and Genre in Spanish American Novels.” In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Mexico City, 26–29 June 2018, 399–403. Mexico City: Red de Humanidades Digitales. https://web.archive.org/web/20200702225303/https://dh2018.adho.org/exploration-of-sentiments-and-genre-in-spanish-american-novels/.).
42. Oakes mentions the following issues that can mask the individual authorial style and make it challenging to attribute texts correctly to a particular author: heterogeneity of authorship over time, genre, gender, variation within a single author, and topic (Oakes 2009, 1072–1073Oakes, Michael P. 2009. “Corpus Linguistics and Stylometry.” In Corpus Linguistics, edited by Anke Lüdeling and Merja Kytö, 1070–1090. Vol. 2. Berlin: De Gruyter. https://doi.org/10.1515/9783110213881.2.1070.).
43. An attempt to neutralize authorial signals in genre analysis has, for example, been made by Calvo Tello et al. (2017Calvo Tello, José, Daniel Schlör, Ulrike Henny, and Christof Schöch. 2017. “Neutralizing the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels.” In Digital Humanities 2017. Conference Abstracts, Montréal, Canada, August 8–11, 2017, 181–184. Montreal: McGill University & Université de Montréal. https://web.archive.org/web/20230212053238/https://dh2017.adho.org/abstracts/037/037.pdf.).
44. The sum of the numbers related to currents is higher than 172 because the same work can be associated with several literary currents. The bibliographic data containing this information is available at https://raw.githubusercontent.com/cligs/bibacme/master/app/data/works.xml, and the script used to retrieve the numbers of sentimental novels can be viewed at https://github.com/cligs/scripts-nh/blob/master/concepts/subgenre-label-combinations.xsl. Both links accessed November 29, 2020.
45. In a presentation about the differentiation of authorship, form, and genre of literary texts, Schöch and Pielström tried to identify statistical components that are clearly attributable to either of these factors. Analyzing French dramatic texts, they found two components that primarily covered differences in authorship but none that were predominantly related to genre (Schöch and Pielström 2014aSchöch, Christof, and Steffen Pielström. 2014a. “Für eine computergestützte literarische Gattungsstilistik.” DHd2014. Konferenzabstracts. Passau: Universität Passau. https://doi.org/10.5281/zenodo.4623620., 2014bSchöch, Christof, and Steffen Pielström. 2014b. “Die Principal Component Analysis für die Differenzierung von Autorschaft, Form und Gattung literarischer Texte.” Talk presented at the <philtag n=12/>, University of Würzburg, September 19, 2014.).
46. Jannidis, Konle, and Leinen (2019Jannidis, Fotis, Leonard Konle, and Peter Leinen. 2019. “Makroanalytische Untersuchung von Heftromanen.” In DHd2019. Digital Humanities: multimedial & multimodal. Konferenzabstracts, Universitäten zu Mainz und Frankfurt, 25. bis 29. März 2019, edited by Patrick Sahle, 167–173. Frankfurt & Mainz: Verband Digital Humanities im deutschsprachigen Raum e.V. https://doi.org/10.5281/zenodo.4622093.), for example, analyzed a corpus of 9,000 dime novels in German language, which were published between 2009 and 2017. They aimed to find out how the subgenres of the dime novels can be differentiated and in what way the corpus as a whole is different from high-prestige novels. They used the 8,000 most frequent nouns as features and classified the novels with Logistic Regression. In addition, a clustering was done on the basis of the 2,000 MFW. To analyze the relevant features for the different subgenres, topic modeling and a contrastive analysis with the Zeta measure were performed. They found that the subgenres can be distinguished well both on a stylistic and a thematic level. Regarding the complexity of dime novels compared to high-prestige literature, they found that sentences are shorter in dime novels. However, they did not find any clear differences in the vocabulary richness or length of the words.
47. In Todorov’s study, the term “register” is probably meant in the linguistic sense of a functionally determined specific way of writing or speaking (for example, formal versus informal) and as a term that is related to that of style.
48. How well the horizons of expectations can be captured through the analysis of historical documents is a point of debate (Voßkamp 1977, 29Voßkamp, Wilhelm. 1977. “Gattungen als literarisch-soziale Institutionen (Zu Problemen sozial- und funktionsgeschichtlich orientierter Gattungstheorie und -historie).” In Textsortenlehre – Gattungsgeschichte, edited by Walter Hinck, 27–44. Heidelberg: Quelle & Meyer.).
49. Another approach that is strongly dependent on a historical anchoring is Voßkamp’s concept of genres as institutions, which are determined by their social and functional history. Voßkamp understands genres as selections among several possible alternatives and argues that prototypical works play an important role in institutionalizing generic conventions and forming generic norms. According to the institutional theory of genres, an important task is to analyze the literary- and socio-historical context and the conditions of the prototypical works’ creation, to understand what differentiates them from the alternatives that existed and what their specific social and historical functions were (Voßkamp 1977, 30–31Voßkamp, Wilhelm. 1977. “Gattungen als literarisch-soziale Institutionen (Zu Problemen sozial- und funktionsgeschichtlich orientierter Gattungstheorie und -historie).” In Textsortenlehre – Gattungsgeschichte, edited by Walter Hinck, 27–44. Heidelberg: Quelle & Meyer.).
50. See footnote 33 above for examples.
51. The term “macroanalysis” was coined by Jockers: “The approach to the study of literature that I am calling ‘macroanalysis’ is in some general ways akin to economics or, more specifically, to macroeconomics. [...] There was, however, ‘microeconomics,’ which studies the economic behavior of individual consumers and individual businesses. As such, microeconomics can be seen as analogous to our study of individual texts via ‘close readings.’ Macroeconomics, however, is about the study of the entire economy. It tends towards enumeration and quantification and is in this sense similar to bibliographic studies, biographical studies, literary history, philology, and the enumerative, quantitative analysis of text that is the foundation of computing in the humanities” (Jockers 2013, 24Jockers, Matthew L. 2013. Macroanalysis. Digital Methods & Literary History. Topics in the Digital Humanities. Urbana, Chicago, and Springfield: University of Illinois Press.). When listing the opportunities of the macroanalytic approach, Jockers mentions several points that are linked to historical contextualization and change: “This approach offers specific insights into literary-historical questions, including insights into: the historical place of individual texts, authors, and genres in relation to a larger literary context; literary production in terms of growth and decline over time or within regions or within demographic groups; literary patterns and lexicons employed over time, across periods, within regions, or within demographic groups; the cultural and societal forces that impact literary style and the evolution of style; the cultural, historical, and societal linkages that bind or do not bind individual authors, texts, and genres into an aggregate literary culture; the waxing and waning of literary themes; the tastes and preferences of the literary establishment and whether those preferences correspond to general tastes and preferences” (Jockers 2013, 24Jockers, Matthew L. 2013. Macroanalysis. Digital Methods & Literary History. Topics in the Digital Humanities. Urbana, Chicago, and Springfield: University of Illinois Press.). Jockers analyzes novels in English based on metadata and full texts, and he repeatedly points out how they develop historically. For example, he trains models for specific decades and analyzes which texts from other decades are stylistically similar to the initial ones, finding approximately thirty-year generations of style. He also finds correlations between the publication dates of novels and their subgenres and subsequently traces the signals of genre style throughout the nineteenth century (Jockers 2013, 82–89Jockers, Matthew L. 2013. Macroanalysis. Digital Methods & Literary History. Topics in the Digital Humanities. Urbana, Chicago, and Springfield: University of Illinois Press.).
52. An example of such a constellation is the German novella, which Schröter described as an instance of “historically discontinuous and heterogeneous genres” (Schröter 2019, 227Schröter, Julian. 2019. “Gattungsgeschichte und ihr Gattungsbegriff am Beispiel der Novellen.” Journal of Literary Theory 13 (2): 227–257.). Rigorous poetics of the novella existed, but these were not consistent with the texts called novellas. Furthermore, the texts carrying the label “novella” could not be described as a homogeneous group (Schröter 2019, 229Schröter, Julian. 2019. “Gattungsgeschichte und ihr Gattungsbegriff am Beispiel der Novellen.” Journal of Literary Theory 13 (2): 227–257.). Schröter proposes to define “intersections” between the texts referred to as novellas and classificatory sets of texts so that, for example, texts with the name novella that were published as fictional journal prose and share the characteristics of the latter type of text would form one subtype of the genre novella. Schröter states: “It is important not to summarily define such intersections as novellas, as is often done. Such a definition would result in a classificatory concept of the respective genre, which in turn would no longer be suitable for comprehending the historical use of the generic label and hence the semantics of a genre in literary-historical communication.” (Schröter 2019, 228Schröter, Julian. 2019. “Gattungsgeschichte und ihr Gattungsbegriff am Beispiel der Novellen.” Journal of Literary Theory 13 (2): 227–257.). Using the terminology proposed here, the intersection of texts referred to as novellas and fictional journal prose would be called a “textual literary genre”, whereas the semantics of the novella in literary-historical communication would be referred to as the “conventional literary genre”. The novella, as a conventional literary genre, would then consist of several different textual genres.
53. For a short description of the usual types of generic signals, see Fricke (1981, 135Fricke, Harald. 1981. Norm und Abweichung. Eine Philosophie der Literatur. München: Beck.). A more comprehensive overview is given in Fowler (1982, 88–105Fowler, Alastair. 1982. Kinds of Literature. An Introduction to the Theory of Genres and Modes. Oxford: Clarendon Press.).
54. A Mexican writer who reflected on the novel’s role in Mexican national literature was, for example, Ignacio Manuel Altamirano, who expressed his ideas in the essay “Revistas Literarias de México” (Altamirano 1868Altamirano, Ignacio Manuel. 1868. Revistas literarias de México. México: T. F. Neve.).
55. The terms “collectif” versus “individuel” and “unique” versus “multiple” are introduced by Schaeffer when discussing the status of different kinds of generic terms. Furthermore, Schaeffer distinguishes between “noms génériques endogènes”, which are used by authors or their public, and “nom génériques exogènes”, which are established by literary historians. The generic terms can have a textual (attached to the literary text, for example, as a paratextual element) or a meta-textual status (if they are used as terms to discuss a work but are external to it), and the functions of the labels vary depending on the category they belong to (Schaeffer 1983, 65, 77–78Schaeffer, Jean-Marie. 1983. Qu’est-ce qu’un genre littéraire? Paris: Seuil.).
56. Rather than on the question of historical convention and institutionalization, Schaeffer focuses on the place and concept of the genre names in the communicative situation. He uses the property of all texts as speech acts as an argument for characterizing genres as more analytical or historical: “Je pense qu’il faut aller plus loin: les genres théoriques, c’est-à-dire en fait les genres tels qu’ils sont définis par tel ou tel critique, font eux-mêmes partie de ce qu’on pourrait appeler la logique pragmatique de la générécité, logique qui est indistinctement un phénomène de production et de réception textuelle. En ce sens on peut dire que l’Introduction à la littérature fantastique de Todorov est elle-même un des facteurs de la dynamique générique, à savoir une proposition spécifique pour un regroupement textuel spécifique et donc pour un modèle générique spécifique [...]” (Schaeffer 1983, 68Schaeffer, Jean-Marie. 1983. Qu’est-ce qu’un genre littéraire? Paris: Seuil.).
57. Schaeffer too argues that there should not be a strict separation of theoretical and historical genres and genre labels, although he recognizes that they follow quite different rules: “le système des genres théoriques, construit à partir d’oppositions différentielles, simples ou multiples, obéit à des contraintes de cohérence qui ne sont pas celles des genres historiques (quelles que soit la réalité de ces genres désignés par les noms de genres traditionells)” (Schaeffer 1983, 67Schaeffer, Jean-Marie. 1983. Qu’est-ce qu’un genre littéraire? Paris: Seuil.). Schaeffer demonstrates why it is not useful to assume a direct deductive relationship between both. However, if theoretical definitions of genres are used as hypotheses about textual genres, no claim about the integrity of the conventional historical genres is made, so that such a deductive procedure seems viable.
58. For definitions of the historical novel in which the mentioned characteristics play a role, see, among others, Fernández Prieto (1996Fernández Prieto, Celia. 1996. “Poética de la novela histórica como género literario.” Signa. Revista de la Asociación Española de Semiótica 5: 185–202. https://www.cervantesvirtual.com/nd/ark:/59851/bmc7p9c7.), Lefere (2013Lefere, Robin. 2013. La novela histórica: (re)definición, caracterización, tipología. Madrid: Visor Libros.), Lukács (1955Lukács, Georg. 1955. Der Historische Roman. Berlin: Aufbau-Verlag.), Maxwell (2009Maxwell, Richard. 2009. The Historical Novel in Europe, 1650–1950. Cambridge: Cambridge University Press.) and Spang (1998Spang, Kurt. 1998. “Apuntes para una definición de la novela histórica.” In La novela histórica. Teoría y comentarios., edited by Kurt Spang, Ignacio Arellano, and Carlos Mata, 63–125. 2nd ed. Pamplona: EUNSA. https://web.archive.org/web/20160504022949/http://www.culturahistorica.es/spang/novela_historica.pdf.).
59. In this case, it would be important not to derive the labels directly from the theoretical definition of the textual genre in order to avoid circular reasoning.
60. The colors were added by me.
61. He thus refers to textual coherence in a broad sense involving all the discursive levels of a speech act. The difference between factual or fictional utterances, for instance, would in fact be difficult to pin down to textual features in a narrower sense.
62. See, in particular, chapter 3.2.3.6, where the empirically based discursive model of generic terms is described.
63. This can be confirmed by the explicit genre labels found in the digital bibliography of nineteenth-century Argentine, Cuban, and Mexican novels created for the present study because the label “novela naturalista” is found in the subtitles of three historical editions (“¿Inocentes o culpables? Novela naturalista” (1884, AR) by Juan Antonio Argerich, “Los bandidos de Río Frío. Novela naturalista, humorística, de costumbres, de crímenes y de horrores” (1892, MX) by Manuel Payno, and “Conventillo de intelectuales. Novela de índole rebelde y de género naturalista que no deben leer las almas timoratas” (1904, AR) by Francisco Guillo). It is also referred to in the prefaces of two of the naturalistic novels whose full text was examined (“El tipo más original” (1879, AR) by Eduardo Ladislao Holmberg and “Perfiles y medallones” (1886, AR) by Silverio Domínguez).
64. Hempfer (1973, 14–29Hempfer, Klaus W. 1973. Gattungstheorie. Information und Synthese. München: Fink.) dedicates a whole chapter to the discussion of terminological problems. Although his study was first published in the seventies, it is still influential today in the German-speaking area. Klausnitzer and Naschert discuss his approach as one of the positions in genre theory which sparked some debate in the twentieth century (Klausnitzer and Naschert 2007, 387–404Klausnitzer, Ralf, and Guido Naschert. 2007. “Gattungstheoretische Kontroversen? Konstellationen der Diskussion von Textordnungen im 20. Jahrhundert.” In Kontroversen in der Literaturtheorie – Literaturtheorie in der Kontroverse, edited by Ralf Klausnitzer and Carlos Spoerhase, 369–412. Bern: Peter Lang.). Neumann and Nünning (2007Neumann, Birgit, and Ansgar Nünning. 2007. “Einleitung: Probleme, Aufgaben und Perspektiven der Gattungstheorie und Gattungsgeschichte.” In Gattungstheorie und Gattungsgeschichte, edited by Marion Gymnich, Birgit Neumann, and Ansgar Nünning, 1–28. Trier: WVT.) also refer to him repeatedly in their overview of problems, tasks, and perspectives of genre theory and history.
65. For instance: “Kinds may in this way give the impression of being fixed, definite things, located in history, whose description is a fairly routine matter. As we shall see, there is something in the idea of definiteness. But describing even a familiar kind is no simple matter. We may think we know what a sonnet is, until we look into the Elizabethan sonnet and are faced with quatorzain stanzas, fourteen-line epigrams, sixteen-line sonnets, and ‘sonnet sequences’ mixing sonnets with complaints or Anacreontic odes. Besides such historical changes within individual kinds, there are wider changes in the literary model allowed for, with their repercussions on the significance and even categorization of generic features” (Fowler 1982, 57Fowler, Alastair. 1982. Kinds of Literature. An Introduction to the Theory of Genres and Modes. Oxford: Clarendon Press.).
66. A difference between Hempfer’s Schreibweisen and Fowler’s modes is that Fowler does not see the modes as ahistorical constants. He describes them as distillations of kinds, i.e. of the features of kinds that seem permanently valuable. In that respect, they are more durable than the historical kinds because they are not linked to external forms that become outdated faster. However, these distillations may also change or become obsolete, as, for example, the heroic mode, which is only conserved in historical or political novels. (Fowler 1982, 111Fowler, Alastair. 1982. Kinds of Literature. An Introduction to the Theory of Genres and Modes. Oxford: Clarendon Press.).
67. In the area of bibliographic modeling, this question is, for example, addressed in the conceptual model FRBR of the International Federation of Library Associations and Institutions (IFLA): “variant texts incorporating revisions or updates to an earlier text are viewed simply as expressions of the same work [...]. Similarly, abridgements or enlargements of an existing text [...] are considered to be different expressions of the same work” (International Federation of Library Associations and Institutions (IFLA) 2009, 17International Federation of Library Associations and Institutions (IFLA). 2009. Functional Requirements for Bibliographic Records. Final report. https://repository.ifla.org/handle/123456789/811.). Translations are also considered as different forms of the same work. On the other side, “when the modification of the work involves a significant degree of independent intellectual or artistic effort, the result is viewed [...] as a new work” (International Federation of Library Associations and Institutions (IFLA) 2009, 17International Federation of Library Associations and Institutions (IFLA). 2009. Functional Requirements for Bibliographic Records. Final report. https://repository.ifla.org/handle/123456789/811.). This includes, for example, paraphrases, summaries, adaptations for children, parodies, and changes from one art form to another. As can be seen, in individual cases, it may be difficult to decide whether changes to a text are a minor revision or a significant modification. Certainly, the question of the definition and unity of a literary work is also central in literary studies. For example, the role of authorship or the prerequisite of completion for a work to be a work can be questioned. For an overview, see Thomé (2007Thomé, Horst. 2007. “Werk.” In Reallexikon der Deutschen Literaturwissenschaft. Neubearbeitung des Reallexikons der deutschen Literaturgeschichte, edited by Klaus Weimar, Harald Fricke, and Jan-Dirk Müller, 832–834. Vol. 2. Berlin, New York: De Gruyter.).
68. See the mentions of the novel in Dill (1999Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.), Fernández-Arias Campoamor (1952 Fernández-Arias Campoamor, José. 1952. Novelistas de Mejico: esquema de la historia de la novela mejicana (de Lizardi al 1950). Madrid: Ediciones Cultura Hispánica. ), Gálvez (1990Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.), Read (1939Read, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.), Sánchez (1953Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.), and Varela Jácome ([1982] 2000Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.).
69. Fowler not only sets forth transformations of genre but also differentiates them from modal transformations, which he calls “generic modulation”, with a different sense than the modulation that Schaeffer describes.
70. The CLiGS group also experimented with this algorithm to detect phases of accelerated literary development in over 300 French twentieth-century novels in a project presented at the conference “Forum Junge Romanistik”, using topics and temporal expressions as features (Schöch et al. 2017Schöch, Christof, Ulrike Henny, José Calvo Tello, Katrin Betz, and Daniel Schlör. 2017. “Epochenschwellen als Phasen beschleunigter literarischer Entwicklung?” Talk presented at the Forum Junge Romanistik, Göttingen, March, 2017. Accessed March 3, 2023. https://christofs.github.io/fjr17/#. ).
71. Gymnich and Neumann synthesize the references between the four levels in a diagram: they interpret the individual-cognitive level as mediating between the textual level and the cultural-historical dimension of genres and describe the functional aspects as superordinate to the other three levels. Their model aims not to provide a homogenizing general theory of genre but an integrative view on the different theoretical approaches to it so that scholarly communication about genres is facilitated (Gymnich and Neumann 2007, 34–35Gymnich, Marion, and Birgit Neumann. 2007. “Vorschläge für eine Relationierung verschiedener Aspekte und Dimensionen des Gattungskonzepts: Der Kompaktbegriff Gattung.” In Gattungstheorie und Gattungsgeschichte, edited by Marion Gymnich, Birgit Neumann, and Ansgar Nünning, 31–52. Trier: WVT.).
72. Bonheim’s method is based on the idea of separating different classes of literary genres by finding necessary (“megafeatures”) and optional elements (“microfeatures”) for each class. An aspect of his model that is of interest from the point of view of digital genre stylistics is that also “loss features” are considered, i.e., features that are negated or missing in certain genres (Bonheim 1992, 2–3Bonheim, Helmut. 1992. “The Cladistic Method of Classifying Genres.” Yearbook of Research in English and American Literature (REAL) 8: 1–32.).
73. “der gleichsam verwickelten Welt der Literatur”.
74. For example by Hempfer (2014, 416–419Hempfer, Klaus W. 2014. “Some Aspects of a Theory of Genre.” In Linguistics and Literary Studies/Linguistik und Literaturwissenschaft. Interfaces, Encounters, Transfers/Begegnungen, Interferenzen und Kooperationen, edited by Monika Fludernik and Daniel Jacob, 405–422. Berlin: De Gruyter.), who proposes to use the concept of family resemblance instead to conceptualize historical genres, such as the elegy, or Tophinke (1997, 161–163Tophinke, Doris. 1997. “Zum Problem der Gattungsgrenze – Möglichkeiten einer prototypentheoretischen Lösung.” In Gattungen mittelalterlicher Schriftlichkeit, edited by Barbara Frank, Thomas Haye, and Doris Tophinke, 161–182. Tübingen: Narr.), who suggests a solution based on prototype theory for official municipal charters and the unofficial ones used by merchants in the Late Middle Ages.
75. Examples of classificatory genre stylistic studies are Calvo Tello (2018Calvo Tello, José. 2018. “Genre Classification in Novels: A Hard Task for Humans and Machines?” In EADH 2018: Data in Digital Humanities. Conference Abstracts. Galway: National University of Ireland. https://web.archive.org/web/20230304103733/https://eadh2018.exordo.com/files/papers/46/final_draft/20181205_genre_classification_human_vs_machines.pdf.), Gianitsos et al. 2019Gianitsos, Efthimios Tim, Thomas J. Bolt, Pramit Chaudhuri, and Joseph P. Dexter. 2019. “Stylometric Classification of Ancient Greek Literary Texts by Genre.” In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Minneapolis, MN, USA, June 7, 2019, 52–60. Minneapolis: Association for Computational Linguistics. http://dx.doi.org/10.18653/v1/W19-2507. , Henny-Krahmer (2018Henny-Krahmer, Ulrike. 2018. “Exploration of Sentiments and Genre in Spanish American Novels.” In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Mexico City, 26–29 June 2018, 399–403. Mexico City: Red de Humanidades Digitales. https://web.archive.org/web/20200702225303/https://dh2018.adho.org/exploration-of-sentiments-and-genre-in-spanish-american-novels/.), Hettinger et al. (2016 Hettinger, Lena, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2016. “Classification of Literary Subgenres.” In DHd2016. Modellierung – Vernetzung – Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts. Universität Leipzig 7. bis 12. März 2016, 160–164. Duisburg: nisaba verlag. https://doi.org/10.5281/zenodo.4645368.), Schöch (2017bSchöch, Christof, ed. 2017b. “theatreclassique.” Accessed December 9, 2022. https://github.com/cligs/theatreclassique.), Schöch, Henny et al. (2016Schöch, Christof, Ulrike Henny, José Calvo Tello, Daniel Schlör, and Stefanie Popp. 2016. “Topic, Genre, Text. Topics im Textverlauf von Untergattungen des spanischen und hispanoamerikanischen Romans (1880–1930).” In DHd 2016. Modellierung, Vernetzung, Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts, 235–239. Leipzig: Universität Leipzig. https://doi.org/10.5281/zenodo.4645380.), and Underwood (2015bUnderwood, Ted. 2015b. Understanding Genre in a Collection of a Million Volumes. White Paper Report. Urbana-Champaign: University of Illinois. http://dx.doi.org/10.17613/M6W07V. ). A special focus on the separation of genre from author signals can be found in Calvo Tello et al. (2017Calvo Tello, José, Daniel Schlör, Ulrike Henny, and Christof Schöch. 2017. “Neutralizing the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels.” In Digital Humanities 2017. Conference Abstracts, Montréal, Canada, August 8–11, 2017, 181–184. Montreal: McGill University & Université de Montréal. https://web.archive.org/web/20230212053238/https://dh2017.adho.org/abstracts/037/037.pdf.) and Schöch (2013Schöch, Christof. 2013. “Fine-tuning Stylometric Tools: Investigating Authorship and Genre in French Classical Theater.” In Digital Humanities 2013. Conference Abstracts, Lincoln, NE, USA, July 16–19, 383–386. Lincoln, NE, USA: University of Nebraska-Lincoln. https://web.archive.org/web/20230304104934/http://dh2013.unl.edu/abstracts/ab-270.html.).
76. Classification as a supervised method of machine learning is introduced in more detail in the analysis part of this study. See chapter 4.2.2.1.
77. Schaeffer defines hypertextual relationships as follows: “J’accepte comme relation générique hypertextuelle toute filiation plausible qu’on peut établir entre un texte et un ou plusieurs ensembles textuels antérieurs ou contemporains dont, sur la foi de traits textuel ou d’index divers, il semble licite de postuler qu’ils ont fonctionné comme modèles génériques lors de la confection du texte en question, soit qu’il les imite, soit qu’il s’en écarte, soit qu’il les mélange, soit qu’il les inverse, etc.” (Schaeffer 1983, 174Schaeffer, Jean-Marie. 1983. Qu’est-ce qu’un genre littéraire? Paris: Seuil.).
78. See, for instance, Schnur-Wellpott’s (1983, 149–159Schnur-Wellpott, Margrit. 1983. Aporien der Gattungstheorie aus semiotischer Sicht. Tübingen: Narr.) exposition of the two perspectives of a founding text and his followers versus a master and his predecessors.
79. See, for example, the availability of scores and probabilities for Support Vector Machines (SVM) or Random Forest Classifiers (RF) (Scikit-learn developers 2007–2023mScikit-learn developers. 2007–2023m. “Support Vector Machines, sec. Scores and probabilites.” Scikit-learn. https://web.archive.org/web/20230304123130/https://scikit-learn.org/stable/modules/svm.html., 2007–2023eScikit-learn developers. 2007–2023e. “sklearn.ensemble.RandomForestClassifier.” Scikit-learn. https://web.archive.org/web/20230304130404/https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.).
80. Unlike Taylor, who uses the term “attributes” for non-classical categorization, here the term “features” is used because it is the term that is usually employed in digital text analysis.
81. Another approach to using statistical classification for analyzing the prototypicality of individual texts with regard to genres is pursued by Konle in his master thesis about word embeddings for literary texts. Using 200-word segments of novels, Konle trained a neural network aiming to find words that are distinctive for different genres. The network is trained for the subgenres of sentimental, crime, science-fiction, and horror novels and is used to classify the segments by genre. When discussing the results, Konle analyses individual segments that were misclassified. He finds out that there are cases in which the misclassifications make sense because the novels are not prototypical for their genres in all parts and contain passages with words that are distinctive for other genres. These can be elements of the plot that are interpreted in terms of another subgenre, for example, passages about the death of characters in sentimental novels that are classified as horror novel segments (Konle 2019, 46–49, 59–64, 69–76Konle, Leonard. 2019. “Word Embeddings für literarische Texte.” Master’s thesis, Würzburg: Julius-Maximilians-Universität Würzburg. https://web.archive.org/web/20230305090725/https://lekonard.github.io/blog/Konle_Thesis.pdf.).
82. For different implementations of clustering algorithms in Python, see Scikit-learn developers (2007–2023aScikit-learn developers. 2007–2023a. “Clustering.” Scikit-learn. https://web.archive.org/web/20230304125710/https://scikit-learn.org/stable/modules/clustering.html.). Practice-oriented introductions to clustering are Müller and Guido (2016, 170–209Müller, Andreas C., and Sarah Guido. 2016. Introduction to Machine Learning with Python: a Guide for Data Scientists. Sebastopol, CA: O’Reilly. and VanderPlas (2017, 462–476VanderPlas, Jake. 2017. Python Data Science Handbook. Essential Tools for Working with Data. 2nd ed. Sebastopol, CA: O’Reilly.). A more theoretically oriented introduction to clustering algorithms is contained in Alpaydin (2016, 143–162Alpaydin, Ethem. 2016. Machine Learning: The New AI. Cambridge, Mass.: The MIT Press.).
83. This becomes especially clear in hierarchical clustering, an approach that divides the data into several levels of clusters, either in a top-down or bottom-up way. The latter is called agglomerative clustering and starts from merging the two most similar points, which again are merged to the next most similar cluster, and so on, either until a certain number of clusters is reached or until all points are merged into the same overall cluster (Müller and Guido 2016, 184Müller, Andreas C., and Sarah Guido. 2016. Introduction to Machine Learning with Python: a Guide for Data Scientists. Sebastopol, CA: O’Reilly.). Hierarchical clustering is often used in stylometric analyses that are concerned with authorship attribution, so that it can be inspected on which close or distant level different authorial candidates are grouped together with the text in question (Eder 2017Eder, Maciej. 2017. “Visualization in stylometry: Cluster analysis using networks.” Digital Scholarship in the Humanities 32 (1): 50–64. https://doi.org/10.1093/llc/fqv061.).
84. This is the case with K-means, spectral clustering, Ward hierarchical clustering, or agglomerative clustering (Scikit-learn developers 2007–2023aScikit-learn developers. 2007–2023a. “Clustering.” Scikit-learn. https://web.archive.org/web/20230304125710/https://scikit-learn.org/stable/modules/clustering.html.).
85. An early critique was formulated by Vivas already in 1968. Vivas discusses several aesthetic, social, and philosophical reasons for the unpopularity of the idea of genres as classes. He criticizes the family resemblance notion as a loophole to avoid the question of the genre’s nominalistic or realistic status: “Taken seriously, nominalism involves the notion that structures have no status in being whatever. But how a totally invertebrate world is possible I have never been able to understand. ‘There is,’ you may say, ‘a new solution to this old problem.’ ‘Yes, I know,’ I reply, ‘there is a newfangled one: It is the evangel of Saint Ludwig.’ According to these glad tidings the members of a class share among themselves, not identities but family resemblances. Obviously I cannot stop to analyze this newfangled solution here. Let me merely lay it down that between two members of a family the resemblance is that of shared identity. We are therefore not farther along than we were before” (Vivas 1968, 101Vivas, Eliseo. 1968. “Literary Classes: Some Problems.” Genre 1: 97–105.). Nevertheless, Vivas defends the idea of genres as open concepts.
86. Hempfer, in turn, criticizes that the common features that Fishelov finds may be necessary but not sufficient (Hempfer 2014, 409–410Hempfer, Klaus W. 2014. “Some Aspects of a Theory of Genre.” In Linguistics and Literary Studies/Linguistik und Literaturwissenschaft. Interfaces, Encounters, Transfers/Begegnungen, Interferenzen und Kooperationen, edited by Monika Fludernik and Daniel Jacob, 405–422. Berlin: De Gruyter.).
87. As an example of the application of the family resemblance concept, Hempfer describes the history of the elegy, a genre that was originally only identifiable metrically and later by a number of other traits, i.a., intertextual references, and motifs (Hempfer 2014, 416–417Hempfer, Klaus W. 2014. “Some Aspects of a Theory of Genre.” In Linguistics and Literary Studies/Linguistik und Literaturwissenschaft. Interfaces, Encounters, Transfers/Begegnungen, Interferenzen und Kooperationen, edited by Monika Fludernik and Daniel Jacob, 405–422. Berlin: De Gruyter.). Hempfer concludes: “The diachrony of the genre can best be represented as a synchronic network of relations, in which each individual text or epochal version of the genre is linked to other historical versions through common features. [...] The genre identity, then, is not produced by a single trait but by the entirety of all relations among their historical versions” (Hempfer 2014, 419Hempfer, Klaus W. 2014. “Some Aspects of a Theory of Genre.” In Linguistics and Literary Studies/Linguistik und Literaturwissenschaft. Interfaces, Encounters, Transfers/Begegnungen, Interferenzen und Kooperationen, edited by Monika Fludernik and Daniel Jacob, 405–422. Berlin: De Gruyter.). For an application of the family resemblance concept to genre theory, see also Strube, who interprets a definition of the novella set up by Seidler in that way (Strube 1993, 21–25Strube, Werner. 1993. Analytische Philosophie der Literaturwissenschaft. Untersuchungen zur literaturwissenschaftlichen Definition, Klassifikation, Interpretation und Textbewertung. Paderborn: Schöningh.).
88. Principal Component Analysis (PCA) is a technique for dimensionality reduction that projects the data points onto so-called “principal components”, which aim to preserve as much variation of the data as possible. The number of dimensions that the data has can be reduced by only considering the resulting principal components further. In digital genre stylistics, it has, for example, been used by Schöch to visualize how French classical tragedies, comedies, and tragicomedies distribute over principal components based on topic features (Schöch 2017c, paras 33–41Schöch, Christof. 2017c. “Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama.” Digital Humanities Quarterly 11 (2). https://web.archive.org/web/20230211105751/http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html.). Schöch groups the PCA analysis under the heading “Clustering”, as does Oakes in his general introduction to statistics for corpus linguistics, because the data is grouped based on similarity (or distance) relationships. Oakes also uses the term “categorization” for clustering methods (Oakes 2003, 95Oakes, Michael P. 2003. Statistics for corpus linguistics. Edinburgh: Edinburgh Univ. Press.). Here, clustering and classification are considered categorization methods (in the general sense of category building). However, PCA is not because the data points as a whole are not assigned to separated text categories. A related method is Factor Analysis, which Biber used to find groups of feature distributions that serve as the basis for defining functional text types (Biber 1993bBiber, Douglas. 1993b. “The Multi-Dimensional Approach to Linguistic Analyses of Genre Variation: An Overview of Methodology and Findings.” Computers in the Humanities 26 (5–6): 331–345. https://doi.org/10.1007/BF00136979.).
89. See, for instance, Underwood’s approach to genre via the history of reception or Schröter’s proposal to apply machine learning methods to reconstruct the historical change of disordered genres such as the German Novelle (Underwood 2016Underwood, Ted. 2016. “The Life Cycles of Genres.” Journal of Cultural Analytics 2 (2). https://doi.org/10.22148/16.005.; Underwood 2019, 34–67Underwood, Ted. 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: The University of Chicago Press.; Schröter, forthcomingSchröter, Julian. Forthcoming. “Machine-Learning as a Measure of the Conceptual Looseness of Disordered Genres: Studies on German Novellen.” In Digital Stylistics in Romance Studies and Beyond, edited by Robert, Hesselbach, José Calvo Tello, Ulrike Henny-Krahmer, Christof Schöch, and Daniel Schlör. Heidelberg: Heidelberg University Publishing.).
90. For a general introduction to concepts of style and stylistics, mainly from a linguistic perspective, see Eroms (2008Eroms, Hans-Werner. 2008. Stil und Stilistik: eine Einführung. Berlin: Schmidt.). In their handbook on rhetorics and stylistics, Fix, Gardt, and Knape (2008Fix, Ulla, Andreas Gardt, and Joachim Knape, eds. 2008. Rhetoric and Stylistics. An International Handbook of Historical and Systematic Research. 2 vols. Berlin, New York: De Gruyter.) give a comprehensive overview of research on style, addressing a broader spectrum of humanities disciplines. An introduction focusing on style in fiction is Leech and Short (2007Leech, Geoffrey, and Mick Short. 2007. Style in Fiction. A Linguistic Introduction to English Fictional Prose. 2nd ed. Harlow, England: Pearson Education Limited.).
91. They focus on definitions at the textual level, including the pragmatic dimension, but do not take into account psychological and cognitive-linguistic theories.
92. The field of “literary” features is understood here as including aspects of the texts that are constitutive, typical, or relevant for them according to literary theory. Usually, the linguistic level is an intermediary between the literary features and their formal expression on the textual surface. Literary features are thus more difficult to formalize than linguistic features because their expression in linguistic surface features is not necessarily straightforward. Regarding the question of what constitutes a literary and what a linguistic feature, there is, of course, some area of overlap. Rhetorical figures, for example, can be conceived as elements characteristic of literary style, but their definition is often directly based on linguistic concepts. For genre analyses, specifically literary features are relevant because in definitions of literary genres, usually literary concepts are used and not linguistic ones. So if the results of digital genre stylistics should be linked to literary theoretical discussions of genre, the question of how to formalize literary features and how to link them to surface text style is a prerequisite.
93. The results of the metadata analysis are taken into account, though, because they provide relevant information about the general status of the subgenres, for example, how often they were explicitly mentioned in subtitles or the narrative perspective in which the novels are written. The first aspect describes them in terms of genre conventions, and the second one can be considered a stylistic trait that characterizes or subdivides individual subgenres.
94. See chapter 2.1.3.3 above on the status of literary currents.
95. For the romantic novel in Spanish America, see Suárez-Murias (1963Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.). The Mexican realist novel is treated in Navarro (1955Navarro, Joaquina. 1955. La novela realista mexicana. México: Compañía General de Ediciones.). The naturalistic novel in Spanish America is covered by Prendes (2003Prendes, Manuel. 2003. La novela naturalista hispanoamericana. Evolución y direcciones de un proceso narrativo. Madrid: Ediciones Cátedra.) and Schlickers (2003Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.). For the Argentine naturalistic novel see Gnutzmann (1998Gnutzmann, Rita. 1998. La novela naturalista en Argentina (1880–1900). Amsterdam, Atlanta: Rodopi.).
96. A monograph about the nineteenth-century Spanish-American Mexican historical novel was published by Read (1939Read, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.). The Latin-American sentimental novel in general (not limited to the nineteenth century) is covered by Zó (2015Zó, Ramiro Esteban. 2015. Emociones escriturales. La novela sentimental latinoamericana. Saarbrücken: Editorial Académica Española.).
97. For details about the criteria used to select novels for the bibliography and corpus, see chapter 3.1. The role of the novels in the process of nation-building is, for instance, discussed in Brushwood (1966Brushwood, John S. 1966. Mexico in its Novel. A Nation’s Search for Identity. Austin: University of Texas Press.) for Mexico, Ferrer (2018Ferrer, José Luis. 2018. La invención de Cuba: Novela y nación (1837–1846). Madrid: Editorial Verbum.) for the Cuban context, and Sommer (1993Sommer, Doris. 1993. Foundational Fictions. The National Romances of Latin America. Berkeley: University of California Press.) for Latin America as a whole.
98. See an overview of the number of novels per decade in chapter 4.1.3. The number of works that were recorded for the bibliography increased from around 20 works that were first published in the 1840s to over 80 works that were published in the 1870s. From the 1870s to the 1880s, the number doubled to about 180 works and remained on that level in the following decades.
99. For a comprehensive overview of content-related subgenres of the Spanish-American novel, see Sánchez (1953Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.).
100. First experiments with the corpus of nineteenth-century Spanish-American novels have been conducted by the author of this study in cooperation with members of the CLiGS project. They have been presented at the German and international DH conferences. For a prototype analysis based on MFW and topics and a classification of subgenres with sentiment features, see Henny-Krahmer et al. (2018Henny-Krahmer, Ulrike, Katrin Betz, Daniel Schlör, and Andreas Hotho. 2018. “Alternative Gattungstheorien. Das Prototypenmodell am Beispiel hispanoamerikanischer Romane.” In DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts, Köln, 26.2.-2.3.2018, edited by Georg Vogeler, 105–112. Köln: Universität zu Köln. https://doi.org/10.5281/zenodo.4622412.) and Henny-Krahmer (2018Henny-Krahmer, Ulrike. 2018. “Exploration of Sentiments and Genre in Spanish American Novels.” In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Mexico City, 26–29 June 2018, 399–403. Mexico City: Red de Humanidades Digitales. https://web.archive.org/web/20200702225303/https://dh2018.adho.org/exploration-of-sentiments-and-genre-in-spanish-american-novels/.), respectively.
101. See chapter 4.1.5.3.1 for an overview of the proportions of thematic subgenres in the bibliography and the corpus and chapter 4.1.5.1 for a list of the most frequent explicit subgenre labels.
102. Juan Manuel de Rosas (1793–1877) was a governor of the province of Buenos Aires who established a dictatorial system marked by repressive measures that lasted between 1829 and 1852 and that enforced a political and economic hegemony of Buenos Aires over the other provinces.
103. Read (1939, 260Read, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.) calls it “the best historical novel of the nineteenth century.”
104. In the Mexican case, a different view on the costumbrista tradition sees the origin of the novels of customs not in Spanish models but in the early works of the Mexican author Fernández de Lizardi (Calderón 2005, 316–317Calderón, Mario. 2005. “La novela costumbrista mexicana.” In La república de las letras. Asomos a la cultura escrita del México decimonónico, edited by Belem Clark de Lara and Elisa Speckman Guerra, 315–324. Vol. 1: Ambientes, asociaciones y grupos. Movimientos, temas y géneros literarios. México: Universidad Nacional Autónoma de México.).
105. “Novela de costumbres” is the second most frequent explicit label for the thematic subgenres. Independently of the explicit label, the novels of customs on the sixth rank of the most frequent primary thematic subgenres in the bibliography and on the third rank in the corpus. See chapters 4.1.5.1 and 4.1.5.3.1 for the corresponding overviews.
106. See the overview of thematic subgenres by narrative perspective in chapter 4.1.5.3.1.
107. Dill, for example, structures the chapter on the romantic novel into the subsections “Der politische Roman”, “Der historische Roman”, “Der indianistische Roman”, “Der kubanische negristische Roman”, and “Der sentimental Roman”. The chapter on the realist novel is not subdivided, and the one on the naturalistic novel only has a subchapter concerned with the city novel (“Der Großstadtroman”) as a special type of naturalistic novel (Dill 1999, 125–139, 159–166, 168–176Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.). In the introduction to her book on the Spanish-American romantic novel, Suárez-Murias lists the sentimental novel, the indianist novel, the historical novel, the costumbrista novel, the Roman à thèse (novela de tesis), and the dime novel (novela de folletín) as subgenres of the romantic current (Suárez-Murias 1963, 12–13Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.). Gálvez, who studies the Spanish-American novel up to 1940, structures the chapter on the novel of the romantic period into subchapters on the historical and the political, the indianist, and the sentimental novel. She dedicates another subchapter to the novel of the transition to realism. That chapter includes parts on the historical, social and costumbrista novel, the latter including the gaucho, indio, and antislavery novel. As she is concentrating on the novel alone, Gálvez’s account is more differentiated than the one of Dill. She takes up several subgenres again in the chapters on the later realist, modernist, and regionalist currents, for example, the historical novel and the novel of customs, showing that they did not cease to exist, but continued to be practiced under the influence of different literary currents (Gálvez 1990Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.). Nevertheless, the later currents did not produce the same range of new, own, distinguishable, and widely recognized thematic subgenres as the romantic current did.
108. On the simultaneous presence of romantic and realist elements, see Lichtblau (1959, 66Lichtblau, Myron I. 1959. The Argentine Novel in the Nineteenth Century. New York: Hispanic Institute in the United States.) and Varela Jácome ([1982] 2000, sec. 2Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.).
109. For the defining characteristics and topics of the realist novel, see also Navarro (1955, 20–24Navarro, Joaquina. 1955. La novela realista mexicana. México: Compañía General de Ediciones.).
110. In literary-historical accounts of the nineteenth-century Spanish-American novel, some works are described as transitional between the romantic and realist currents, for example, “La Calandria” (1890, MX) by Rafael Delgado, or the historical novels of the Mexican writer Juan Antonio Mateos (Varela Jácome [1982] 2000, sec. 2.1.3Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.; Gálvez 1990, 96, 105Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.). Furthermore, literary historians come to different conclusions regarding the status of some works in relationship to literary currents. The novel “Cecilia Valdés o La Loma del Ángel” (1839/1882, CU) by Villaverde, for instance, is described as primarily romantic with realistic elements by Suárez-Murias and Varela Jácome, as realist by Dill, and as transitional between both currents by Gálvez (Dill 1999, 160–161Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.; Gálvez 1990, 115–117Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.; Suárez-Murias 1963, 36–40Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.; Varela Jácome [1982] 2000, sec. 1.3Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.).
111. The great number of naturalistic novels that were written in Spanish America becomes evident in the comprehensive study that Schlickers published on the Spanish-American naturalistic novel. She includes 63 novels in her book and discusses almost every novel in its own chapter (Schlickers 2003Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.).
112. On the difficulty to delimit the terms and concepts of costumbrismo, realismo, regionalismo, and naturalismo, see Navarro (1955, 12–19Navarro, Joaquina. 1955. La novela realista mexicana. México: Compañía General de Ediciones.). Sánchez, for instance, only uses the term “novela naturalista” for works of both the realist and naturalistic types (Sánchez 1953, 257–259Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.).
113. For example, the concentration on one principal character in the Bildungsroman versus a broad picture with several important characters in historical novels (Fludernik 2009, 628Fludernik, Monika. 2009. “Roman.” In Handbuch der literarischen Gattungen, edited by Dieter Lamping and Sandra Poppe, 627–645. Stuttgart: Kröner.).
114. Up to the seventeenth century, for example, also works in verse form could be called Romane (novels). (Steinecke 2007, 317Steinecke, Hartmut. 2007. “Roman.” In Reallexikon der deutschen Literaturwissenschaft, edited by Klaus Weimar, Harald Fricke, and Jan-Dirk Müller, 317–323. Berlin, New York: De Gruyter.).
115. Nevertheless, earlier attempts to define fictionality that focused on reference semantics took this line. According to these views, fictional and factual texts are characterized by a respective specific mode of referentiality. Whereas a factual text could be defined as a text that references the empirical reality directly and is conceived and perceived as such by the author and its readers, a fictional text predominantly references invented places, characters, or events. Elements existing in the real world can also be part of a fictional text, but there should be elements in a fictional text that do not have any counterpart in reality. This means that these elements cannot be referred to anything existing before and outside of their linguistic formulation and creation in the text. However, it has been shown that the assumption of different modes of referentiality for fictional and factual texts is problematic, at least in the outlined narrow conception of the term (Weidacher 2017, 375–378Weidacher, Georg. 2017. “Fiktionalität und Fiktionalitätssignale.” In Handbuch Sprache in der Literatur, edited by Anne Betten, Ulla Fix, and Berbeli Wanning, 373–390. Berlin, New York: De Gruyter.).
116. The peritext includes paratexts that are published as part of a work, e.g., prefaces and dedications, whereas the epitext involves paratexts that are published outside of the immediate context of the work (Genette 1987Genette, Gérard. 1987. Seuils. Paris: Seuil.).
117. For some of the works listed in bibliographies of the Spanish-American novel in the nineteenth century, no edition could be found neither in the WorldCat (a union catalog currently containing more than 430 million bibliographic entries from libraries worldwide, see OCLC 2023OCLC. 2023. “Inside WorldCat.” https://web.archive.org/web/20230325164614/https://www.oclc.org/en/worldcat/inside-worldcat.html.) nor in individual relevant library catalogs. Even if editions could be located, it was not in every case feasible to see them, especially if there were only print editions distributed over different American libraries. The issue of accessibility of the texts will be discussed in more detail in chapter 3.3.1 (“Selection of Novels and Sources”).
118. In the portal “Novela hispanoamericana del siglo XIX”, which is part of the “Biblioteca Virtual Miguel de Cervantes”, for example, “Facundo” is classified as “novela histórica argentina” but also as a biography of Juan Facundo Quiroga and as an autobiography of the author Sarmiento (Sarmiento [1845] 2000Sarmiento, Domingo Faustino. (1845) 2000. Vida de Juan Facundo Quiroga (en formato HTML). Edited by Benito Varela Jácome. Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmc18359.).
119. Other similar references are made to the authors Alexander von Humboldt, Jacques Bernardin Henri de Saint-Pierre, William Wordsworth, Gregorio Gutiérrez González, François-René de Chateaubriand, Henry Wadsworth Longfellow, Domingo Faustino Sarmiento, and Olegario Víctor Andrade (González [1905] 2001, XII, XXGonzález, Joaquín Víctor. (1905) 2001. Mis montañas (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmcw37r4.).
120. The script that produced the box plot is available at https://github.com/cligs/scripts-nh/blob/master/corpus/direct-speech-travelogues.xsl and the result at https://github.com/cligs/data-nh/blob/master/corpus/direct-speech-travelogues.html. Accessed January 24, 2020. The group of novels that the travelogues were compared to is a subset of the whole corpus consisting of 92 novels in which direct speech has been marked up.
121. The presence of narrative, subjective, and fictitious elements in travelogues has a long tradition in Spanish-American writings, going back to some chronicles of the Conquista (Anderson Imbert 1995, 17–48Anderson Imbert, Enrique. 1995. Historia de la literatura hispanoamericana. Vol. 1: La colonia. Cien años de república. 2nd ed. México: Fondo de Cultura Económica.). This contributes to the literary character of these works but does not justify considering them plain fiction. The generic ambiguity of the three travel narratives discussed here is also evidenced by their inclusion or exclusion in other text collections and bibliographies. In the “Biblioteca Virtual Miguel de Cervantes”, for example, all three texts are part of the portal “Novela hispanoamericana del siglo XIX”. While “La tierra natal” and “Mis montañas” are also classified as novels in the general keyword system of the virtual library, “Una excursión a los indios ranqueles” is not. It is labeled with “Descripciones y viajes” (Gorriti [1889] 2001Gorriti, Juana Manuela. (1889) 2001. La tierra natal (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmc222t4.; González [1905] 2001González, Joaquín Víctor. (1905) 2001. Mis montañas (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmcw37r4.; Mansilla [1870] 2001Mansilla, Lucio Victorio. (1870) 2001. Una excursión a los indios ranqueles. Tomo Primero (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmcn8760.). In the bibliography of the Argentine novel authored by Lichtblau, “Una excursión a los indios ranqueles” is included as a borderline case, while “La tierra natal” and “Mis montañas” are not mentioned (Lichtblau 1997Lichtblau, Myron. 1997. The Argentine novel: an annotated bibliography. Lanham, Maryland: Scarecrow.).
122. The following clarifications of the definition of narration are a summary of Weber‘s more detailed explanations.
123. He is explicit in two aspects, though: What he considers an Argentine novel and how he distinguishes novels and short novels from short stories (Lichtblau 1997, XV–XVILichtblau, Myron. 1997. The Argentine novel: an annotated bibliography. Lanham, Maryland: Scarecrow.).
124. Sometimes novels have also been versified by other authors, for instance, the novels of the Argentine Eduardo Gutiérrez, which have been reworked by Bartolomé R. Aprile, Silverio Manco, and Apolinario Sierra. See, for example, Gutiérrez and Aprile ([1944] 2015Gutiérrez, Eduardo, and Bartolomé R. Aprile. (1944) 2015. Juan Cuello. Novela histórica de Eduardo Gutiérrez, versificada por Bartolomé R. Aprile. Berlin: Ibero-Amerikanisches Institut – Preußischer Kulturbesitz. http://resolver.iai.spk-berlin.de/IAI00005CB300000000.), Gutiérrez and Manco ([1948] 2015Gutiérrez, Eduardo, and Silverio Manco. (1948) 2015. El rastreador. Novela histórica de Eduardo Gutiérrez, versificada por Silverio Manco. Berlin: Ibero-Amerikanisches Institut – Preußischer Kulturbesitz. http://resolver.iai.spk-berlin.de/IAI00005C9700000000.), and Gutiérrez and Sierra ([1944] 2015Gutiérrez, Eduardo, and Apolinario Sierra. (1944) 2015. Aparicio Saravia. Novela histórica por Eduardo Gutiérrez, versificada por Apolinario Sierra. Berlin: Ibero-Amerikanisches Institut – Preußischer Kulturbesitz. http://resolver.iai.spk-berlin.de/IAI00005CB200000000.).
125. Especially editions with a layout in two columns lead to more words per page than single-column layouts. Some of Eduardo Gutiérrez’ novels were published with a two-column layout. See, for instance, Gutiérrez ([1893] 2016Gutiérrez, Eduardo. (1893) 2016. Carlo Lanza. Episodios curiosos. Wikimedia Commons. https://web.archive.org/web/20230325192637/https://upload.wikimedia.org/wikipedia/commons/e/e0/Carlo_Lanza_-_Eduardo_Gutierrez.pdf.); Gutiérrez ([1880] 2016aGutiérrez, Eduardo. (1880) 2016a. El Jorobado. Wikimedia Commons. https://web.archive.org/web/20230325193326/https://upload.wikimedia.org/wikipedia/commons/c/c3/El_Jorobado_-_Eduardo_Gutierrez.pdf.); Gutiérrez ([1880] 2016bGutiérrez, Eduardo. (1880) 2016b. Juan Moreira. Wikimedia Commons. https://web.archive.org/web/20230325193551/https://upload.wikimedia.org/wikipedia/commons/2/21/Juan_Moreira_-_Eduardo_Gutierrez.pdf.).
126. Fludernik traces the history of the novel from early modern precursors up to the twentieth century and states that the novel is a European genre which spread internationally in particular in the nineteenth and twentieth centuries. Nevertheless, the novel as a genre was not unknown in the Spanish-American colonies in earlier centuries. The circulation and reception of European novels in the colonies and the existence of precursors of the Spanish-American novel are set out, for example, in Sánchez (1953, 67–127Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.). See also Lindstrom (2004, 47–77Lindstrom, Naomi. 2004. Early Spanish American Narrative. Austin: University of Texas Press.).
127. For example, the short novels written by José Joaquín Pesado, Ignacio Rodríguez Galván, Ramón de Palma y Romay, Félix Tanco y Bosmeniel, and Juana Manuela Gorriti.
128. For instance, the historical novels of Ireneo Paz and Juan Antonio Mateos, the realist novels of Carlos María Ocantos, and the naturalistic novels of Eugenio Cambaceres and Federico Gamboa.
129. Like the modernist novels written by Amado Nervo and Efrén Rebolledo, for example.
130. There are exceptions here, too. “La loca de la guardia” (1896, AR) by Vicente Fidel López has the subtitle “Cuento histórico” and is a work published as a book with almost 500 pages. There are also some texts of intermediate length which are called “cuento”, for example, “María, la perla de la Diaria. Cuento cubano” (1866, CU) by Rafael Otero, published independently with 119 pages, and “El hogar en la pampa (Cuento)” (1866, AR) by Santiago Estrada with 133 pages. They can all be considered novels.
131. Mata mentions a whole range of denominations for the short novel in nineteenth-century Mexico.
132. For example, “María del Consuelo. Novela” (1894, MX) by Alberto Leduc with 39 pages, “Comunidad de nombres y apellidos. Novela original” (1845, CU), and “Teresa. Novela original” (1839, CU) by Cirilo Villaverde with 63 and 93 pages, or “Un ángel y un demonio, o el valor de un juramento (Novela original)” (1857, AR) by Margarita Ochagavia with 104 pages.
133. The “Panoramas de la vida. Colección de novelas, fantasías, leyendas y descripciones americanas” (1876, AR) by Juana Manuela Gorriti, “Tardes nubladas. Colección de novelas” (1871, MX) by Manuel Payno, and “Mesa revuelta. Colección de artículos de amena literatura, opúsculos, juicios críticos, historietas, novelas, folletines, revistas viejas y otras muchas cosas” (1860, CU) by Francisco Calcagno, for instance.
134. Assessing materials for the bibliographic database, the following works and collections with the subtitle “novela(s) corta(s)” were found: “La manigua sentimental. Novela corta” (1910, CU) by Jesús Castellanos, “Otras vidas. Novelas cortas” (1909, MX) by Amado Nervo, “Gil Luna, artista. Novelas cortas” (1908, CU) by Luis Rodríguez Émbil, “El enemigo. Novela corta” (1908, MX) by Efrén Rebolledo, “Thespis (Novelas cortas y cuentos)” (1907, AR) by Carlos Octavio Bunge, “Voces perdidas (Novelas cortas y cuentos)” (c. 1907, AR) by Jorge Lavalle Cobo, “Sucesos y novelas cortas” (1903, MX) by José López-Portillo y Rojas, “Novelas cortas de varios autores” (1901, MX) – a compilation of earlier short novels – and “La capilla de los álamos. Colección de novelas cortas” (1892, MX) by Manuel Covarrubias y Acevedo. Furthermore, there were volumes of collected works entitled “novelas cortas”: “Obras del Sr. D. J. María Roa Bárcena. Novelas cortas” (1910, MX), “Obras de Don Florencio M. del Castillo. Novelas cortas” (1902, MX), “Obras de Don Manuel Payno. Novelas cortas” (1901, MX), “Obras del Lic. D. J. López-Portillo y Rojas. Novelas cortas“ (1900, MX). They were all published in the late nineteenth and early twentieth century. An exception is the earlier collection “Horas de tristeza. Novelas cortas” (1849, MX) by Florencio M. del Castillo.
135. There are other approaches to the Mexican short novel besides Mata‘s. In particular, the portal “La novela corta. Una biblioteca virtual” has been developed by a research project hosted at the Universidad Nacional Autónoma de México (2008–2023Universidad Nacional Autónoma de México. 2008–2023. “La Novela Corta. Una biblioteca virtual.” https://web.archive.org/web/20230328173719/https://www.lanovelacorta.com/.). The portal is accompanied by critical approaches to the short novel published in five volumes, among them Chaves (2011Chaves, José Ricardo. 2011. “Huellas y enigmas de la novela corta en el siglo XIX.” In Una selva infinita. La novela corta en México (1872–2011), edited by Gustavo Jiménez Aguirre, Gabriel M. Enríquez Hernández, Esther Martínez Luna, Salvador Tovar Mendoza, and Raquel Velasco, 109–127. México: Fundación para las Letras Mexicanas.). Like Mata, Chaves (115–119Chaves, José Ricardo. 2011. “Huellas y enigmas de la novela corta en el siglo XIX.” In Una selva infinita. La novela corta en México (1872–2011), edited by Gustavo Jiménez Aguirre, Gabriel M. Enríquez Hernández, Esther Martínez Luna, Salvador Tovar Mendoza, and Raquel Velasco, 109–127. México: Fundación para las Letras Mexicanas.) links the history of the short novel in Mexico to European traditions (German, French, and English). In a compilation of Mexican romantic short novels, Ruedas de la Serna concludes: “Sin embargo su interés radica precisamente en que fueron los primeros ensayos narrativos de nuestros escritores en que surge una clara conciencia de la expresión literaria. Cierto que estos avanzaban penosamente en el dominio de esta nueva técnica de representación de la realidad, de la que, como de tantas otras cosas, se nos había privado. Cuánto, sin embargo, no habrían contribuido estas obritas en la batalla de nuestros intelectuales del siglo pasado por transformar su sociedad, y cuánto no deben a estas primicias los novelistas posteriores” (Cárabes and Ruedas de la Serna 1998, 71–72Cárabes, Celia Miranda, and Jorge A. Ruedas de la Serna. 1998. La novela corta en el primer romanticismo mexicano. 2nd ed. Mexico: Universidad Nacional Autónoma de México.), thus evaluating the early short novels as first narrative attempts, the view preferred here. For Argentina and Cuba, no comprehensive studies of the short novel in the nineteenth century could be found.
136. In accounts of the nineteenth-century Spanish-American novel, shorter novels are often included, especially the early ones published in the 1830‘s and onwards, which are mentioned as first novels of their kind, for example, “Netzula” (1837, MX) by José María Lacunza as the first indianist novel (Brushwood 1966, 71Brushwood, John S. 1966. Mexico in its Novel. A Nation’s Search for Identity. Austin: University of Texas Press.; Sánchez 1953, 546Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.; Varela Jácome [1982] 2000, 4Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.). Suárez-Murias mentions various “novelitas” when tracing the development of the Cuban romantic novel (Suárez-Murias 1963, 22–27Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.). Molina includes several short novels in her book about the early Argentine nineteenth-century novel (Molina 2011, 405–489Molina, Hebe Beatriz. 2011. Como crecen los hongos. La novela argentina entre 1838 y 1872. Buenos Aires: Teseo.).
137. For example, in the title “Fru Jenny. Seis novelas danesas” (1915, AR) by Carlos María Ocantos, a cycle of six novellas that share the same geographic setting and are published together in one volume. This work is out of the scope of this dissertation anyway because of its publication after 1910.
138. The quality of texts that have been digitized using OCR without being corrected afterward is usually not sufficient. See, for example, the full-text versions of texts uploaded to the Internet Archive (e.g., Ramírez [1868] 2008Ramírez, José María. (1868) 2008. “Full text of ‘Una rosa y un harapo: Novela original’.” Internet Archive. https://web.archive.org/web/20230328190153/https://archive.org/stream/unarosayunharap00ramgoog/unarosayunharap00ramgoog_djvu.txt.).
139. tokens =
re.split(r"\W+", text, flags=re.MULTILINE)
140. If a page turned out to be a blank page or a page containing an index or only an image, it was replaced by the next (or preceding) regular text page because these are considered exceptional page types. On the other hand, chapter beginnings and endings with fewer words than full text pages were kept because they appear regularly in the novels to a certain extent. Parts of words occurring at the beginning or the end of a page were counted as whole words.
141. The script used to generate the list of random pages and the box plot are available at https://github.com/cligs/scripts-nh/blob/master/corpus/words-per-page.py. Lists of the random pages and the corresponding editions of novels can be accessed at https://github.com/cligs/data-nh/blob/master/corpus/random-pages.csv and https://github.com/cligs/data-nh/blob/master/corpus/pages-novels.xml, respectively. The full texts of the selected pages are collected in the file https://github.com/cligs/data-nh/blob/master/corpus/pages-text.xml, and the resulting box plot file can be found at https://github.com/cligs/data-nh/blob/master/corpus/words-per-page.html. Accessed January 25, 2020.
142. The data that was used here were preliminary corpus and bibliography, which were successively refined. The works chosen were written by Argentine, Cuban, and Mexican authors (see chapter 3.1.2 for an explanation of how the country to which an author belongs was determined) and published between 1830 and 1910 (see chapter 3.1.3 explaining the chronological limits used here). See chapters 3.2.1 and 3.3.1 for details about the sources of the full-text corpus and the bibliographic database, respectively. All bibliographic references of novels missing page numbers were left out. Furthermore, works contained in the following collections were not considered for the calculation of the typical length of a novel: “Mesa revuelta. Colección de artículos de amena literatura, opúsculos, juicios críticos, historietas, novelas, folletines, revistas viejas y otras muchas cosas” (1860, CU) by Francisco Calcagno, “Leyendas, novelas y artículos literarios” (1877, CU) by Gertrudis Gómez de Avellaneda, and “Panoramas de la vida. Colección de novelas, fantasías, leyendas y descripciones” (1876, AR) by Juana Manuela Gorriti because it is unclear which of the genres mentioned in the titles apply to which texts; “Horas de tristeza. Colección de novelas” (1850, MX) by Florencio María del Castillo and “Tardes nubladas. Colección de novelas” (1871, MX) by Manuel Payno because they were published in other editions with the title “Novelas cortas”; and “Novelas en germen” (1900, MX) by Emilio Bobadilla because the texts are all shorter than 50 pages and the title “Novelas en germen” can be interpreted as designating short novels.
143. Obviously every work can have multiple editions. In the case of several different editions, the mean of their respective number of pages was used to balance out differences regarding the number of words per page. A work with several editions was considered eligible as a novel if at least one of the various editions published between 1830 and 1910 carried the label “novela”. For the full texts, only one available edition was used to count the words. Another factor of uncertainty when using page numbers of bibliographic entries is that they usually refer to the pagination of the books and not to the number of pages of the work, so prefaces, indexes, appendices, etc. might be included, which means that the works themselves are possibly shorter than calculated here.
144. The script that was used to select the works from the preliminary corpus and bibliography, to calculate the numbers of pages and words, and to create the box plots for figures 4, 5, 6, and 7 is available at https://github.com/cligs/scripts-nh/blob/master/corpus/words-novelas.xsl. The corresponding data and results can be viewed at https://github.com/cligs/data-nh/tree/master/corpus/words-novelas. A list of all the works that were used for the calculation of the word (and page) limit is given at https://github.com/cligs/data-nh/blob/master/corpus/words-novelas/novelas-length.csv. Accessed January 27, 2020.
145. A right-skewed distribution is one with many low and few high values and with a mean that is higher than the median.
146. The medians and quartiles are rounded to the nearest thousand. For the full texts alone, the median is at 61,000 words, and for the bibliographic entries alone, at 36,000 words. That the numbers are higher for the full texts is probably due to the selection of the texts: many of the digitized nineteenth-century novels available in full-text format are long novels, which tend to be considered paradigmatic. Furthermore, many historical novels were chosen for the corpus, and these have a tendency to be longer than novels of other subgenres. In addition, very short novels were avoided in the collection of the full texts, but in the bibliographic entries, they were included.
147. For example, the novels “Dos niñas hechiceras (novela original)” (1874, AR), published independently under the pseudonym “Guindilla”, and “Una sanjuanina, o sea Carolina. Novela de costumbres” (1881, MX) by Guillermo Quiroga, which are both only 18 pages long.
148. The word length for the “novelas cortas” was determined in the same way as for the “novelas”. Three works were available as full texts so that their words were counted with a regular expression. For the other 62 works, the number of pages was converted to a number of words using the mean number of words per page calculated above.
149. The values were rounded to the nearest hundred. The upper fence defines a limit between values that can still be considered typical in a distribution and those that can be considered outliers. It is set at the largest sample that is larger than the third quartile (Q3) but still lower than Q3 + 1.5 * IQR (the interquartile range, which is Q3 - Q1).
150. Rounded from 16,044.
151. Rounded from 83.8. Interestingly, this limit is very close to the 80 pages assumed by Fludernik (2009, 632Fludernik, Monika. 2009. “Roman.” In Handbuch der literarischen Gattungen, edited by Dieter Lamping and Sandra Poppe, 627–645. Stuttgart: Kröner.).
152. The number of words in the full texts was rounded to the next thousand before the limit was applied.
153. Two forms of serial publication were common: the novela por entregas, where parts of the novel were delivered loosely, as a booklet accompanying a newspaper, or included in a literary magazine, and the novela de folletín, where the novel was published subsequently in specific columns of a daily newspaper (Molina 2011, 27Molina, Hebe Beatriz. 2011. Como crecen los hongos. La novela argentina entre 1838 y 1872. Buenos Aires: Teseo.; Villegas Cedillo 1984, 12–15Villegas Cedillo, Alberto. 1984. La novela popular mexicana en el siglo XIX. San Nicolás de los Garza: Universidad Autónoma de Nuevo León.).
154. In spite of this, a bias towards the selection of the more canonized works can hardly be avoided because they are the ones that are better transmitted and more often digitized, especially as full texts. Moreover, bibliographies of the novel refer mainly to monographic publications. See chapters 3.2.1 and 3.3.1 on the selection of texts for the bibliography and the corpus for details.
155. In this case, it is strange that the third part has a later publication date than the fourth part. There must be an earlier edition of “Don Manuel de Paloche”, but this could not be verified.
156. Gutiérrez was a very productive writer who wrote 34 novels, many of them organized in cycles. Besides the “Dramas militares”, he also wrote “Dramas cómicos”, “Dramas policiales”, and “Dramas del terror”.
157. Ocantos wrote a whole series of 20 novels called “Novelas argentinas”, published between 1888 and 1929, to which also “Entre dos luces” and “El candidato” belong (Ianes 2018, 19Ianes, Raúl. 2018. “La ficción de la lengua en las Novelas argentinas de Carlos María Ocantos: una lectura histórica.” Decimonónica 15 (2): 14–28. https://web.archive.org/web/20230328191041/https://www.decimononica.org/wp-content/uploads/2018/07/Ianes_15.2.pdf.).
158. If content-related criteria would be considered, they could be: Is the set of main characters identical in the different parts? Is the setting the same? Is the plot a direct continuation or predecessor of another part‘s plot? The relationships between the various parts of “Libro extraño”, for example, are discussed by Gnutzmann (1998, 183–185Gnutzmann, Rita. 1998. La novela naturalista en Argentina (1880–1900). Amsterdam, Atlanta: Rodopi.).
159. That way, clearly unfinished and unpublished works are excluded (for example, “Beatriz” (MX) by Ignacio Manuel Altamirano) but works where only a self-contained part was realized and published or transmitted are included (for example, “Ambarina. Historia doméstica cubana. Tomo I” (1858, CU) by Virginia Felicia Auber de Noya).
160. For example, “El pozo del Yocci” (1876, AR) by Juana Manuela Gorriti, which was published as part of the collection “Panoramas de la vida”, or “Las ranas pidiendo rey. Confesiones de una afrancesada (1861–1862)” and “La corte de Maximiliano. Nuevas confesiones de una afrancesada (1863–1867)” (1903, MX) by Victoriano Salado Álvarez, published as parts of two different volumes of the “Episodios Nacionales Mexicanos”.
161. Usually, novels can be easily distinguished from collections of short stories regarding their structure because, typically, novels have numbered chapters. Cases that need a closer look are books containing narrative, fictional text, a title not mentioning the genre, and various parts with headings but without numbering because they could either be novels or collections of short stories. A special case in this regard is the work “Pago Chico” (1908, AR) by Roberto Payró. Originally, its parts were published individually and at different times in a literary magazine. In the monographic form, the parts are connected as numbered chapters. The work is characterized as follows by literary historians: “Pago Chico, loser Kranz von Erzählungen mit gemeinsamem Protagonisten, der Stadt Pago Chico” (Dill 1999, 210Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.); “Payró, der in der Provinzstadt Bahía Blanca selbst Opfer politischer Repression geworden war, zeichnete im Verlauf seiner journalistischen Karriere das satirische Bild dieses Systems sowohl in Artikeln als auch in einer Serie von Erzählungen, die 1908 und 1928 unter den Titeln Pago Chico und Nuevos cuentos de Pago Chico in Buchform zusammengefasst wurden” (Rössner 2007, 347–348Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.). Even though the monograph was published during the lifetime of the author and in the time frame of this study, this work is excluded from the bibliography and the corpus because it was not primarily conceived as a novel.
162. See chapter 3.1.3 (“Limits of the Nineteenth Century”) explaining the chronological limits used here.
163. E.g., the novel “El manantial” (1908, AR) by Emma de la Barra and the series of didactic and popular scientific novels “La ciencia recreativa” (1871–1879, MX) by Alberto F. Arriaga.
164. “Peregrinación de Luz del Día“ is included in Lichtblau’s bibliography of the Argentine novel (Lichtblau 1997, 15–16Lichtblau, Myron. 1997. The Argentine novel: an annotated bibliography. Lanham, Maryland: Scarecrow.), whereas “Los dioses de la Pampa” is not, but it is labeled as a novel in the “Biblioteca Virtual Miguel de Cervantes” and in “Wikisource” (see Daireaux [1945] 2001Daireaux, Godofredo. (1945) 2001. Los dioses de la Pampa (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmcp55j4 .; Daireaux 2006Daireaux, Godofredo. 2006. “Los dioses de la Pampa.” Wikisource. https://web.archive.org/web/20230328202403/https://es.wikisource.org/wiki/Los_dioses_de_la_Pampa_%28Versi%C3%B3n_para_imprimir%29.).
165. In that sense, it is synonymous to “Ibero-America”. In a broader understanding, “Latin America” can also include French-speaking Caribbean and South-American countries or the whole geographical region south of the United States of America (Ardao 1980, 13–27Ardao, Arturo. 1980. Genesis de la idea y el nombre de América Latina. Caracas: Centro de Estudios Latinoamericanos Rómulo Gallegos.).
166. On Latin-American literature, e.g., Dill (1999Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.), Rössner (2007Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.), Smith (1997Smith, Verity, ed. 1997. Encyclopedia of Latin American Literature. Chicago, Illinois: Fitzroy Dearborn Publishers.), and Sommer (1993Sommer, Doris. 1993. Foundational Fictions. The National Romances of Latin America. Berkeley: University of California Press.). On Spanish-American literature, see, for example, Anderson Imbert (1954Anderson Imbert, Enrique. 1954. Historia de la literatura hispanoamericana. México: Fondo de Cultura Económica.), Janik (2008Janik, Dieter. 2008. Hispanoamerikanische Literaturen. Von der Unabhängigkeit bis zu den Avantgarden (1810–1930). Tübingen: Narr Francke Attempto.), and Zum Felde (1954Zum Felde, Alberto. 1954. Índice crítico de la literatura hispanoamericana. 2 vols. México: Editorial Guaranía.). On the Spanish-American novel, for instance, Alegría (1959Alegría, Fernando. 1959. Breve historia de la novela hispanoamericana. México: Ed. de Andrea.), Gálvez (1990Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.), Goić (1980Goić, Cedomil. 1980. Historia de la novela hispanoamericana. 2nd ed. Valparaiso: Ed. Univ. de Valparaiso.), Meléndez (1961Meléndez, Concha. 1961. La novela indianista en Hispanoamerica (1832–1889). Rio Piedras: Universidad de Puerto Rico.), Meyer-Minnemann (1979Meyer-Minnemann, Klaus. 1979. Der spanischamerikanische Roman des Fin de siècle. Tübingen: Niemeyer.), Phillips-López (1996Phillips-López, Dolores. 1996. La novela hispanoamericana del modernismo. Genève: Ed. Slatkine.), Sánchez (1953Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.), Schlickers (2003Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.), Suárez-Murias (1963Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.), and Varela Jácome ([1982] 2000Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.). There are also academic journals dedicated to the literature of the region, e.g., the “Anales de Literatura Hispanoamericana” and the “Cuadernos de Literatura del Caribe e Hispanoamérica”.
167. See, for example, Rössner (2007, 130–199Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.). The literature from 1820 up to 1900 is presented in chapters on different regions: Mexico, Central America, the Caribbean, Columbia and Venezuela, the Andean countries, the Cono Sur, and Brazil. The Cono Sur designates the southern area of South America, comprising Chile, Argentina, Uruguay, and Paraguay. A book organized by country is Suárez-Murias (1963Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.). Examples of approaches presenting the developments in each chapter (e.g., on the historical novel or the naturalistic novel) by country are Dill (1999Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.) and Sánchez (1953Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.).
168. The same arguments – a common linguistic heritage and the need to overcome a Spanish past – are invoked by Rojas Mix as part of a first cultural Hispanoamericanismo in the spirit of Simón Bolívar while a later, second Hispanoamericanismo (the one expressed by the modernist writers) involves reconciliation between Spain and the Americas in favor of a common Hispanic identity (Rojas Mix 1987, 60–64Rojas Mix, Miguel. 1987. “La cultura hispanoamericana del siglo XIX.” In Historia de la Literatura Hispanoamericana, edited by Luis Íñigo Madrigal, 55–74. Vol. 2: Del neoclasicismo al modernismo. Madrid: Ediciones Cátedra.).
169. See chapter 4.1.5 below for overviews of the subgenres in the bibliography and the corpus.
170. Before deciding on the selection of the countries, several digital catalogs and libraries were checked to see if the number of novels would be enough to create a digital corpus of considerable size and suitable for quantitative analyses. The first search was performed in the WorldCat, a union catalog containing items of print and digital media alike (see OCLC 2001–2023OCLC. 2001–2023. “WorldCat.” https://www.worldcat.org/de. Accessed March 28, 2023.). Searching for items published between 1830 and 1910 with the keywords “novela” and the names of Spanish-American capitals gave the following results: México (926), Buenos Aires (352), Habana (240), Bogotá (121), Santiago de Chile (120), Lima (87), La Paz (61), Montevideo (61), Caracas (46), Guatemala (36), Quito (19), Asunción (0). All searches were performed lastly on October 21, 2019. In the advanced search of the WorldCat, there is no field for the place of publication, so the place names were entered as general keywords. This leads to some false positives in the results because the keyword might also be part of a title or of a name. In this and the following searches, of the Middle American countries and capitals, only Guatemala was searched for because, in the other countries, the establishment of national literatures was thwarted by a long process of disintegration after the cease of the Viceroyalty of New Spain (Rössner 2007, 149Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.). The second search was performed in the HathiTrust Digital Library (see HathiTrust 2008–2023HathiTrust. 2008–2023. “HathiTrust Digital Library.” https://www.hathitrust.org/. Accessed March 28, 2023.). A search for catalog items including the word “novela” which were published between 1830 and 1910 in different Spanish-American capitals yielded the following numbers of results: México (178), Habana (72), Buenos Aires (66), Bogotá (31), Santiago de Chile (26), Montevideo (23), Caracas (18), Lima (15), La Paz (15), Guatemala (7), Quito (5), Asunción (0). In HathiTrust’s advanced search, the language and publication year can be searched explicitly, but the place of publication cannot. It was therefore added as a general search term. In the “Biblioteca Virtual Miguel de Cervantes” (see Centro Biblioteca Virtual Miguel de Cervantes 2023Centro Biblioteca Virtual Miguel de Cervantes. 2023. “Biblioteca Virtual Miguel de Cervantes.” Accessed March 28, 2023. http://www.cervantesvirtual.com/.), searches for “novela argentina”, “novela mexicana”, etc. and “Siglo 19°” resulted in: novela mexicana (49), novela argentina (42), novela colombiana (21), novela cubana (21), novela chilena (13), novela uruguaya (12), novela peruana (9), novela ecuatoriana (5), novela venezolana (3), novela boliviana (1), novela guatemalteca (1), novela paraguaya (0). A search for places of publication is not possible in the “Biblioteca Virtual Miguel de Cervantes”. A search for a range of publication dates is also hardly possible. In the advanced search, there is no specific search field for the year of publication. There are several subject areas related to chronology, but they overlap (e.g., “Narrativa argentina -- Siglo 19º”, “Novela argentina -- Siglo 19º”, “Novela histórica argentina -- Siglo 19º” where the latter are not necessarily contained in the former) and the search for the specification of the subject area (the part after “--”) does not work. Therefore the result lists were checked manually for nineteenth-century novels. Of course, these searches only approximate the number of novels published in the different countries, but they show that Argentina, Mexico, and Cuba were comparatively rich in novels in the nineteenth century, followed by Columbia.
171. There were several reasons for writers to publish their novels in other countries, for example, political exile or residence in another country for professional or personal reasons (especially in the case of Cuba, which was still a Spanish colony up to 1898). Numbers are available from the bibliography that was created for this dissertation. Of the Argentine works, 90 % of the editions appeared in Argentina, 6 % in Spain, 4 % in France, and 2 % in other countries. Of the Mexican works, 90 % of the editions appeared in Mexico, 9 % in Spain, 4 % in France, and 5 % in other countries. Of the Cuban works, 60 % of the editions appeared in Cuba, 28 % in Spain, 3 % in the USA, and 8 % in other countries. The sums are not exactly at 100 % because the numbers were rounded and also because some publishing houses had branches in several countries. See also chapter 4.1 on metadata analysis for details about the number of works and editions.
172. The unitarians advocated for a centralized government favoring Buenos Aires, while the federalists wanted a federation of autonomous provinces.
173. The most prominent case is Gertrudis Gómez de Avellaneda. She was born in Puerto Príncipe in Cuba in 1814 and died in Madrid in 1873. She lived both in Cuba and in Spain, where she remained after 1840. Her novels were published partly in Cuban and in Spain, and also the settings and topics of her works cover American as well as European spheres (Instituto de Literatura y Lingüística de la Academia de Ciencias de Cuba 1999, sec. Gómez de Avellaneda, GertrudisInstituto de Literatura y Lingüística de la Academia de Ciencias de Cuba. 1999. Diccionario de la literatura cubana (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmckh0j1.; Remos y Rubio 1945, 148–157Remos y Rubio, Juan J. 1945. Historia de la literatura cubana. Vol. 2: Romanticismo. La Habana: Cárdenas y Compañía.). Gómez de Avellaneda is mentioned in Spanish literary-historical works (see, for instance, Neuschäfer 2001, 269Neuschäfer, Hans-Jörg, ed. 2001. Spanische Literaturgeschichte. 2nd ed. Stuttgart/Weimar: J.B. Metzler.; Wolfzettel 1999, 48Wolfzettel, Friedrich. 1999. Der spanische Roman von der Aufklärung bis zur frühen Moderne. Tübingen/Basel: Francke.) but is a more prominent figure in Cuban literary histories, especially because of the significance of her novel “Sab” (1841) with a Cuban theme (Mitjans [1918] 2010, 355–367Mitjans, Aurelio. (1918) 2010. Historia de la Literatura cubana (en formato HTML). Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmc1g0w8.; Remos y Rubio 1945, 148–152 and 227–243Remos y Rubio, Juan J. 1945. Historia de la literatura cubana. Vol. 2: Romanticismo. La Habana: Cárdenas y Compañía.). There are many cases of authors who were either born or died in Cuba or Spain, changed their residence from the colony to the mother country or vice versa, and unfolded their literary activities in one or both places. How such cases are treated regarding the bibliography and corpus created here is explained further below.
174. In a study on the Cuban novel and nation, Ferrer explains that the process of developing a Cuban national consciousness before the country’s political independence is unquestioned and that it can just be debated how early this awareness matured. He maintains that already the first Cuban novelistic production between 1837 and 1846 played a major role in the consolidation of a coherent image of a Cuban nation (Ferrer 2018, 11–19Ferrer, José Luis. 2018. La invención de Cuba: Novela y nación (1837–1846). Madrid: Editorial Verbum.). Sáinz de Medrano also sees a connection between the nineteenth-century Cuban novels and the development of a national consciousness: “Cuba conoció en esa centuria [el siglo XIX] un extraordinario desarrollo del relato en prosa, que parecía querer compensar la anterior penura literaria. Coincide este auge con movimientos sociales y políticos de notable intensidad, determinados en gran parte por la crisis abierta en torno a las relaciones de dependencia con España [...] y la sedimentación de una conciencia nacional. Un factor de marcada incidencia en este contexto será el problema de la esclavitud, que dará lugar a todo un ciclo de novelas” (Sáinz de Medrano 1987, 145Sáinz de Medrano, Luis. 1987. “Cirilo Villaverde.” In Historia de la Literatura Hispanoamericana, edited by Luis Íñigo Madrigal, 145–153. Vol. 2: Del neoclasicismo al modernismo. Madrid: Ediciones Cátedra.).
175. Especially the meetings in the house of Domingo Delmonte (1804–1853) (Suárez-Murias 1963, 20–21Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.).
176. On the system of slavery and the economic conditions in the nineteenth century, see Zeuske (2002, 69–89Zeuske, Michael. 2002. Kleine Geschichte Kubas. 2nd ed. München: C. H. Beck.). On the novel of customs and the antislavery novel in Cuba, see Rivas (1990Rivas, Mercedes. 1990. Literatura y esclavitud en la novela cubana del siglo XIX. Sevilla: Escuela de Estudios Hispano-Americanos.) and Suárez-Murias (1963, 23–24 and 25–40Suárez-Murias, Marguerite C. 1963. La novela romántica en Hispanoamérica. New York: Hispanic Institute in the United States.).
177. A similar strategy is followed by Molina (2011, 395Molina, Hebe Beatriz. 2011. Como crecen los hongos. La novela argentina entre 1838 y 1872. Buenos Aires: Teseo.), who includes works written by Argentine authors or published in Argentina. Lichtblau considers nationality, residence, and cultural identification but does not explicitly include all novels published in the country: “The problem of identifying those works that clearly belong in the classification ‘novela argentina’ beset me at every stage in the preparation of this bibliography. But I have attempted, within a certain necessary arbitrariness inherent in all literary categorization, to be consistent in the selection or omission of the works cited. As used in this bibliography, an Argentine novel is understood to be any novel written by an Argentine or by a person residing in Argentina and culturally identified with that country” (Lichtblau 1997, xvLichtblau, Myron. 1997. The Argentine novel: an annotated bibliography. Lanham, Maryland: Scarecrow.). In other monographs and bibliographies, the question is not treated explicitly, e.g., in Fernández-Arias Campoamor (1952 Fernández-Arias Campoamor, José. 1952. Novelistas de Mejico: esquema de la historia de la novela mejicana (de Lizardi al 1950). Madrid: Ediciones Cultura Hispánica. ) or Torres-Rioseco (1933Torres-Rioseco, Arturo. 1933. Bibliografía de la novela mejicana. Cambridge, Massachusetts: Harvard University Press.). In the “Diccionario de la literatura cubana”, the inclusion of an author is explained in each unclear case (Instituto de Literatura y Lingüística de la Academia de Ciencias de Cuba 1999Instituto de Literatura y Lingüística de la Academia de Ciencias de Cuba. 1999. Diccionario de la literatura cubana (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmckh0j1.).
178. Some Cuban authors had to leave the country because they openly opposed the colonial regime, for example, José Martí, who was a leading figure in the struggle for political and cultural independence of the country. Many Argentine writers also left the country during the time of the Rosas regime, for example, José Mármol, who published the first part of his novel “Amalia” (1855, AR) in Montevideo (Lichtblau 1959, 43Lichtblau, Myron I. 1959. The Argentine Novel in the Nineteenth Century. New York: Hispanic Institute in the United States.; Rössner 2007, 207–208Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.).
179. For example, the Argentine Carlos María Ocantos, who worked in Spain as a diplomat, or the Cuban Gertrudis Gómez de Avellaneda, who moved to Spain with her family as a young woman (Cárrega 1986, 27–30Cárrega, Hemilce. 1986. Las novelas argentinas de Carlos María Ocantos. Buenos Aires: Febra Editores.; Instituto de Literatura y Lingüística de la Academia de Ciencias de Cuba 1999Instituto de Literatura y Lingüística de la Academia de Ciencias de Cuba. 1999. Diccionario de la literatura cubana (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmckh0j1.).
180. For example, the novel “Pablo ou la vie dans les pampas” (1869, AR) by Eduarda Mansilla, which was published in Spanish as “Pablo o el hombre de las pampas” one year later, is excluded. In chapter 3.2.2 on the data model and text encoding it is explained how the authors’ places of birth and death, their nationalities, the places of publication of the novels’ editions, and the assignment of works to a country are encoded in the bibliography.
181. Argentina’s independence was declared officially on July 9, 1816, after the overthrow of the River Plate viceroyalty in 1810 (Lichtblau 1959, 15Lichtblau, Myron I. 1959. The Argentine Novel in the Nineteenth Century. New York: Hispanic Institute in the United States.). Mexico became independent on February 24, 1821, when the catholic church and the creoles opted for a constitutional monarchy (Rössner 2007, 137Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.). On the first cultural expressions of a beginning awareness for the own country in Cuba, see Ferrer (2018, 225–281Ferrer, José Luis. 2018. La invención de Cuba: Novela y nación (1837–1846). Madrid: Editorial Verbum.) and Rössner (2007, 152–153Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.). In Mexico, Fernández de Lizardi published several works before 1830, especially the novel “El Periquillo Sarniento” (1816, MX). However, these are not considered here because of their exceptional status. They are often described as forerunners of the nineteenth-century novel (Alegría 1959, 18–26Alegría, Fernando. 1959. Breve historia de la novela hispanoamericana. México: Ed. de Andrea.; Janik 2008, 34–36Janik, Dieter. 2008. Hispanoamerikanische Literaturen. Von der Unabhängigkeit bis zu den Avantgarden (1810–1930). Tübingen: Narr Francke Attempto.; Sánchez 1953, 111, 115–123Sánchez, Luis Alberto. 1953. Proceso y contenido de la novela hispano-americana. Madrid: Editorial Gredos.).
182. The year 1910 is also chosen by Anderson Imbert (1954Anderson Imbert, Enrique. 1954. Historia de la literatura hispanoamericana. México: Fondo de Cultura Económica.). Some set the limit a bit earlier at the turn of the century or later, e.g., in 1920 (Alegría 1959Alegría, Fernando. 1959. Breve historia de la novela hispanoamericana. México: Ed. de Andrea.; Ertler 2002Ertler, Klaus-Dieter. 2002. Kleine Geschichte des lateinamerikanischen Romans: Strömungen – Autoren – Werke. Tübingen: Gunter Narr.; Rössner 2007Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.).
183. For more detailed overviews of the political, economic, and social history, see Bernecker (1992, vol. 2: Lateinamerika von 1760 bis 1900Bernecker, Walter L., ed. 1992. Handbuch der Geschichte Lateinamerikas. Vol. 2: Lateinamerika von 1760 bis 1900. Stuttgart: Klett-Cotta.).
184. The number of novels published in the three countries were calculated based on the bibliography described in chapter 3.2. How the kinds of sources of the bibliography might have influenced the numbers is discussed in chapter 3.2.1. For each novel, the decade of the first known edition was determined. The following numbers resulted: Argentina: 1830–1840 (2 novels), 1841–1850 (2), 1851–1860 (25), 1861–1870 (17), 1871–1880 (23), 1881–1890 (87), 1891–1900 (75), 1901–1910 (72); Mexico: 1830–1840 (3), 1841–1850 (5), 1851–1860 (9), 1861–1870 (55), 1871–1880 (64), 1881–1890 (83), 1891–1900 (74), 1901–1910 (102); Cuba: 1830–1840 (8), 1841–1850 (19), 1851–1860 (31), 1861–1870 (12), 1871–1880 (10), 1881–1890 (15), 1891–1900 (22), 1901–1910 (14). Charts displaying these numbers are presented in chapter 4.1.3.
185. Works are considered clearly unfinished if it is obvious from the structure or content that parts of the work are missing, for example, several chapters. On the other hand, if a series of several novels was envisaged by an author but not finished, individual novels that form part of it are still included. See also chapter 3.1.1.5 above on questions of the unit of the “novel”.
186. The data about the number of editions per novel are taken from the bibliography created for this investigation. Of the 829 novels, 70 % had only one edition that was published between 1830 and 1910, 19 % had two editions, and 10 % more than two editions. The numbers are rounded. See also chapter 4.1.4 for overviews of the data about editions contained in the bibliography and the corpus.
187. See chapter 3.3.1 for an overview of the types of editions used.
188. An earlier version of the bibliographic database was presented at the conference HDH2017 at the University of Málaga (Henny-Krahmer 2017Henny-Krahmer, Ulrike. 2017. “Bib-ACMé: Bibliografía digital de novelas argentinas, cubanas y mexicanas (1810–1930).” In III Congreso de la Sociedad Internacional Humanidades Digitales Hispánicas. Sociedades, políticas, saberes (Libro de resúmenes), edited by Nuria Rodríguez Ortega, 99–104. Málaga: Universidad de Málaga. https://web.archive.org/web/20200514082600/https://humanidadesdigitaleshispanicas.es/wp-content/uploads/2020/03/Actas-HDH2017.pdf.). The work on the bibliography is managed on the version control platform GitHub. The first version of BibACMé from 2017 can be accessed at https://github.com/cligs/bibacme/releases/tag/v1.0. Accessed October 30, 2019. Ongoing work is available at: https://github.com/cligs/bibacme. Accessed October 30, 2019. For an online publication of the database with background information, basic search functionality, and some synoptical charts, see Henny-Krahmer (2017–2021Henny-Krahmer, Ulrike, ed. 2017–2021. “Bib-ACMé. Bibliografía digital de novelas argentinas, cubanas y mexicanas (1830–1910).” Version 1.2. Zenodo. https://doi.org/10.5281/zenodo.4453491.).
189. For a discussion of the problem up to the year 2004, see Romanos de Tiratel (2004Romanos de Tiratel, Susana. 2004. “La bibliografía nacional Argentina: una deuda pendiente.” In Proceedings of the 70th IFLA General Conference and Council, 1–11. Buenos Aires. https://web.archive.org/web/20230603163155/https://archive.ifla.org/IV/ifla70/papers/046s_Tiratel.pdf.).
190. Because Cuba was a Spanish colony until 1898, there are many authors who were born in Cuba but moved to Spain or vice-versa.
191. For an overview of the history of and the current bibliographic work in Mexico, see Escalona Rios (2006Escalona Rios, Lina. 2006. “El trabajo bibliográfico en México.” In Recursos bibliográficos y de información, edited by Hugo Alberto Figueroa Alcántara and César Augusto Ramírez Velázquez, 185–215. México: Universidad Nacional Autónoma de México. http://hdl.handle.net/10391/4727.).
192. It is important to note that the bibliography of Torres-Rioseco builds mainly on the earlier work by Iguiniz (1926Iguiniz, Juan Bautista. 1926. Bibliografía de novelistas mexicanos: ensayo biográfico, bibliográfico y crítico. México: Imprenta de la Secretaria de relaciones exteriores.).
193. See chapter 3.1.2 on the “Borders of Argentina, Cuba, and Mexico” above.
194. See the script at https://github.com/cligs/scripts-nh/blob/master/corpus/bibacme-sources.py. The resulting chart “sources shares” can be downloaded at https://github.com/cligs/data-nh/tree/master/corpus/bibliography-sources. Accessed January 27, 2020.
195. See chapter 3.1.2, which touches upon the historical backgrounds of the three countries.
196. In this context, “digital editions” does not necessarily refer to digital critical scholarly editions but also to digitized editions of all kinds (e.g., published in full text, HTML format, or as PDF or image files). Print editions of the novels were also checked, but not comprehensively, for reasons of time and cost. In part, novels could be obtained through the German interlibrary loan, especially from the “Ibero-Amerikanisches Institut” in Berlin, but many editions can only be consulted in American libraries.
197. This applied, for example, to the work “Doce episodios de la vida de Bernabé Loyola, escritos por él mismo y dedicados a sus queridos hijos” (1876, MX) by Bernabé Loyola, which is listed in Torres-Rioseco’s bibliography of the Mexican novel, but whose fictional status is unclear because the name of the author is the same as the name mentioned in the title.
198. Two works that are mentioned in the “Bibliografía de la novela mejicana” but which were excluded here are, for example, “Narraciones humorísticas y cuentos infantiles” (1885, MX) by Manuel Covarrubias y Acevedo because it is a collection of short stories written for children, and “Staurófila. Precioso cuento alegórico. Parábola en que se simboliza los amores de Jesucristo con el alma devota” (1903, MX) by María Nestora Téllez Rendón because it is probably not predominantly realistic.
199. Of course, a collection of short novels written and published by the same author can also be considered a literary work; an anthology of works by different authors compiled for a secondary publication would usually not be considered as such. Here, “individual literary work” refers to works in the smallest sense, i.e., individual novels.
200. In those cases, there are also exceptions, but they are relatively few in number. See chapter 3.1.1.4 above.
201. By nationality, or, in the case of Cuba, by birth or predominant place of activity. See chapter 3.1.2 above.
202. This was the case for the novel “El dios del siglo. Novela original de costumbres contemporáneas” (1848, ES) by Jacinto de Salas y Quiroga. In Torres-Rioseco’s bibliography it is included with an edition of 1853 published in Mexico, but the work was published first in 1848 in Madrid, and the author is Spanish.
203. See the script at https://github.com/cligs/scripts-nh/blob/master/corpus/bibacme-sources.py. The resulting chart “sources inclusion” can be downloaded at https://github.com/cligs/data-nh/tree/master/corpus/bibliography-sources. Accessed January 27, 2020.
204. See https://github.com/cligs/bibacme/blob/master/app/data/entries-sources.csv. Accessed January 27, 2020. For the DLC, only possible candidates are listed, but not all the bibliographic entries, because the source is not a bibliography of the Cuban novel but a general literature dictionary, so no other definition of the novel applies. For Lichtblau and Torres-Rioseco, which are general bibliographies of the novel, only entries referring to 1830–1910 and those with unclear publication dates are listed because they were candidates for Bib-ACMé. All other entries of works published before 1830 or after 1910 were disregarded from the beginning. In the bibliographies of Lichtblau and Torres-Rioseco, explicit or implicit definition criteria for the novel apply, which are not entirely congruent with the ones developed in this dissertation, so the table shows where decisions to include a work as a novel differ between the bibliographies and Bib-ACMé. The table also contains works that were not included in the three main sources but were added from other sources. The entries are made on the work level, not by publication or edition.
205. Especially archival, print, and hemerographic sources.
206. A further type of entity defined in FRBR (subjects of intellectual endeavors: concept, object, event, and place) is not relevant here.
207. In the case of a novel, the expression means “the specific words, sentences, paragraphs, etc. that result from the realization of a work in the form of a text” (International Federation of Library Associations and Institutions (IFLA) 2009, 19International Federation of Library Associations and Institutions (IFLA). 2009. Functional Requirements for Bibliographic Records. Final report. https://repository.ifla.org/handle/123456789/811.).
208. See chapters 3.2.3 and 3.3.4 below, where it is explained how the assignment of subgenre labels to the bibliographic entries and texts in the corpus was made.
209. This point could be discussed further. On the level of genres that are at least in part determined formally, the issue is quite clear, but on the level of (sub)genres that are mainly determined thematically and by content, it is more complicated. How much change is needed to affect to which subgenre a novel belongs? Would it then be considered a new work? However, this discussion is out of the scope of this dissertation because the evolution of individual works is not traced here.
210. For example, the versified versions of Eduardo Gutiérrez’s novels. See footnote 124 above.
211. “Un año en California” (1869, AR) and “Un viaje al país del oro” (1876, AR) by Juana Manuela, Gorriti, for instance, are considered as one work here because it is just the title of the novel that changed, but not its content.
212. Some authors had a whole range of pseudonyms, for example, the Cuban author Teodoro Guerrero y Pallarés, who wrote under seven other names (Instituto de Literatura y Lingüística de la Academia de Ciencias de Cuba 1999, sec. ‘G’Instituto de Literatura y Lingüística de la Academia de Ciencias de Cuba. 1999. Diccionario de la literatura cubana (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmckh0j1.).
213. For an overview, see Burnard (2014Burnard, Lou. 2014. What is the text encoding initative? How to add intelligent markup to digital resources. Encyclopédie numérique 3. Marseille: OpenEdition Press. https://doi.org/10.4000/books.oep.426.).
214. The data is accessible at https://github.com/cligs/bibacme/tree/master/app/data. Accessed November 7, 2019.
215. The description of the element in the TEI Guidelines (Text Encoding Initiative Consortium 2013hText Encoding Initiative Consortium. 2023h. “<nationality>.” In: TEI P5: Guidelines for Electronic Text Encoding and Interchange, 1455–1457. Version 4.6.0. Revision f18deffba. https://web.archive.org/web/20230423102337/https://tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf.) allows interpreting it in this wide sense because it is said to contain an informal description of a person’s citizenship, and the type of nationality can be characterized further (by birth, naturalized, or self-assigned).
216. See the next chapter 3.2.3, for details about how the subgenres were assigned to the entries in Bib-ACMé.
217. See chapter 3.1.2 on the “Borders of Argentina, Cuba, and Mexico” above, where this decision is explained.
218. For details about how bibliographic references are encoded in TEI, see Text Encoding Initiative Consortium (2023bText Encoding Initiative Consortium. 2023b. “Components of Bibliographic References.” In: TEI P5: Guidelines for Electronic Text Encoding and Interchange, 146–164. Version 4.6.0. Revision f18deffba. https://web.archive.org/web/20230423102337/https://tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf.).
219. RELAX NG is a schema language for XML. It can either be expressed as an XML document or in a compact, non-XML syntax. See Murata (2014Murata, Makoto. 2014. “RELAX NG home page.” https://web.archive.org/web/20230604105524/https://relaxng.org/.). For Bib-ACMé, the compact syntax is used because it gives a quick overview of the documents’ structure.
220. Schematron is a rule-based schema language using the query language XPath to validate XML documents. It allows to define rules that consider the context of elements and attributes so that very specific constraints can be formulated (Siegel 2022Siegel, Erik. 2022. Schematron. A language for Validating XML. Denver: XML Press.). All the schema files for Bib-ACMé are available at https://github.com/cligs/bibacme/tree/master/app/schemas (the RELAX NG files ending in “.rnc” and the Schematron files in “.sch”). Accessed November 8, 2019. Although the TEI offers the possibility to create a meta schema called ODD (“One document does it all”, Text Encoding Initiative Consortium n.d.bText Encoding Initiative Consortium. n.d.b “Getting Started with P5 ODDs.” https://web.archive.org/web/20230423104437/https://tei-c.org/favicon.ico.) from which schemas in different syntaxes can be derived, it was decided not to use one or several ODDs for Bib-ACMé. An ODD is very useful for general TEI models created for mixed content scenarios, where the same element can be used in different contexts and with a variety of attributes, but it is quite complex to narrow an ODD down to a strict data model for highly structured data. In the case of Bib-ACMé, this effort would not have been compensated by the additional benefits of the ODD.
221. The usage of the label “novela original” in the Spanish context is described by Botrel as follows: “Las normas/formas tipográficas bibliográficas permiten también observar cómo después de un período en el que se precisa el origen de la novela («novela escrita en francés por Mr.» o «Madama...», «en inglés por Mistress...» o «Sr...» y «traducida al castellano por...» iniciales) al preeminencia del título unida con la hispanización casi sistemática de los nombres de los autores traducidos (Javier de Montepín, Pablo Feval, etc. y la importancia numérica de las traducciones, con la desaparición de la mención del traductor, al menos en las referencias bibliográficas, hace que el género novela venga disociado de una por otra parte deseada hispanidad y asociado con una patronímica y toponimia extranjerizante, como producto extranjero o, más probablemente, asimilado. La mención «novela original» o «española» introducirá durante cierto tiempo una distinción poco decisiva, estadísticamente al menos” (Botrel 2001, paras 12, 13Botrel, Jean-François. 2001. “La novela, género editorial (España, 1830–1930).” In La novela en España en los siglos XIX y XX. Historia, sociedad, búsqueda identitaria, edited by Paul Aubert, 35–51. Madrid: Casa de Velázquez. https://web.archive.org/web/20230606180953/https://books.openedition.org/cvz/2631.).
222. Of 829 works in the bibliography, only 403 carry the explicit label “novela”.
223. In romantic novels, it is common that issues and people are portrayed in black and white (e.g., the good and the bad), exaggerated, or dramatized, and the word “diablo” points in that direction.
224. In the case at hand, “novela sentimental” refers primarily to the content and plot of the novel, while “novela romántica” points to the literary current. Nevertheless, there are overlaps between both terms.
225. The counts are based on the final bibliography. Novels carrying several different labels were counted twice, and the percentages were rounded. See also chapter 4.1.5 for an overview of the subgenres in the bibliography and the corpus. The normalized frequencies are higher because some novels carry explicit labels, e.g., “histórico” or “costumbres”, but not the whole label involving the word “novela”, i.e., “novela histórica” or “novela de costumbres”, or they contain additional elements. For example: “novela de carácter histórico”, “episodio histórico”, “leyenda histórica”, “historia novelada”; “novela original de costumbres”, “cuadro de costumbres”, “boceto de costumbres”, “ensayo de costumbres”, etc. All the subgenre labels that occur at least ten times in the normalized version are included. In addition, the number of novels without any explicit label is also given. The script used to determine the subgenre label counts given in this and the following tables in this section is available at https://github.com/cligs/scripts-nh/blob/master/corpus/frequencies-subgenre-labels.xsl and the full table of frequencies can be viewed at https://github.com/cligs/data-nh/blob/master/corpus/bibliography-subgenre-labels/frequencies-explicit-labels.csv. Accessed March 7, 2020.
226. Strictly speaking, this is no bias, though, because the novelistic production of the time is as it is: some authors wrote more novels than others, and some wrote whole series of novels of a certain subgenre.
227. See https://github.com/cligs/data-nh/blob/master/corpus/bibliography-subgenre-labels/frequencies-explicit-thematic-labels.csv for a full table of frequencies of explicit thematic labels in the bibliography. Accessed March 7, 2020.
228. The table shows the top ten subgenre assignments and the number of novels without any assignment. See https://github.com/cligs/data-nh/blob/master/corpus/bibliography-subgenre-labels/frequencies-subgenre-labels.csv for the full table. Accessed March 7, 2020.
229. This is usually only outlined explicitly in specific and comprehensive studies of certain subgenres, e.g., in Rivas (1990Rivas, Mercedes. 1990. Literatura y esclavitud en la novela cubana del siglo XIX. Sevilla: Escuela de Estudios Hispano-Americanos.) for the anti-slavery novel or Schlickers (2003Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.) for the naturalist novel. In contrast, in overview works such as general literary histories, the criteria for the assignment of novels to subgenres are normally not explained.
230. See chapter 2.3 for a presentation of the subgenres related to themes and literary currents.
231. Which subgenre labels are grouped in which subgenre category here is explained in detail below (see table 10).
232. Which subgenre labels are related to others is explained below in table 10.
233. The table shows the top ten plus the number of entries in the bibliography without this type of subgenre label. Full tables are available at https://github.com/cligs/data-nh/blob/master/corpus/bibliography-subgenre-labels/frequencies-thematic-labels.csv and https://github.com/cligs/data-nh/blob/master/corpus/bibliography-subgenre-labels/frequencies-labels-currents.csv. Accessed March 8, 2020.
234. See chapter 3.3.4 about the assignment of subgenre labels for the texts in the corpus.
235. Examples for romantic novelas de costumbres are “Ironías de la vida” (1851, MX) and “La hora de Dios” (1865, MX) by Pantaleón Tovar and the series of “novelas de costumbres” written by José Tomás de Cuéllar (Fernández-Arias Campoamor 1952, 63–65 Fernández-Arias Campoamor, José. 1952. Novelistas de Mejico: esquema de la historia de la novela mejicana (de Lizardi al 1950). Madrid: Ediciones Cultura Hispánica. e been characterized as ). Novels of this type that havrealist are “La familia Quillango” (1880, AR) by José María Cantilo and “Antón Pérez” (1903, MX) by Manuel Sánchez Mármol (Fernández-Arias Campoamor 1952, 86 Fernández-Arias Campoamor, José. 1952. Novelistas de Mejico: esquema de la historia de la novela mejicana (de Lizardi al 1950). Madrid: Ediciones Cultura Hispánica. ; Gálvez 1990, 126–127Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.). Naturalistic novels carrying the label “costumbres” in their title are “Quimera. Boceto de costumbres” (1899, AR) by José Luis Cantilo and “Fruto vedado (Costumbres argentinas)” (1884, AR) by Paul Groussac. On Costumbrismo as a longer lasting phenomenon, Fernández-Arias Campoamor writes: “Los novelistas románticos que fueron costumbristas constituyen el puente tendido entre el romanticismo y el realismo. Costumbrismo cultivado ocasionalmente, en realidad, lo hubo siempre en todas las literaturas [...] Pero el costumbrismo como inclinación extensa y generalizada se inicia en el romanticismo” (Fernández-Arias Campoamor 1952, 56 Fernández-Arias Campoamor, José. 1952. Novelistas de Mejico: esquema de la historia de la novela mejicana (de Lizardi al 1950). Madrid: Ediciones Cultura Hispánica. ). Kohut, too, points to the significance of Costumbrismo for several other literary currents: “Die Abgrenzung zwischen Romantik, Realismus und Naturalismus gestaltet sich schwierig. [...] Die Problematik wird durch den sogenannten Costumbrismo zusätzlich kompliziert, der wie in Spanien zwischen Romantik und Realismus steht. Zum Realismus gehört die Zuwendung zur Gesellschaft, zur Romantik die häufig idyllisierende Perspektive. [...] Wichtiger als der Costumbrismo als eigenständige literarische Richtung ist die entsprechende Einfärbung zahlreicher realistischer bzw. Naturalistischer Romane. So gab der Chilene Alberto Blest Gana seinem Roman Martín Rivas (1862) den Untertitel Novela de costumbres político-sociales, der Argentinier Lucio Vicente López seinem Roman La gran aldea (1884) den Untertitel Costumbres bonaerenses” (Kohut 2016, 196Kohut, Karl. 2016. Kurze Einführung in Theorie und Geschichte der lateinamerikanischen Literatur (1492–1920). Berlin: Lit Verlag.).
236. Alternative terms are novela del gaucho, novela urbana, novela indianista, novela antiesclavista, and novela policial. In some cases, the alternative formulations do not involve differences in the meaning of the terms (this is assumed for novela gauchesca and novela del gaucho and for novela de la ciudad and novela urbana). In other cases, there are slight differences in the meaning. Novela criminal, for example, is more general than novela policial, and was therefore preferred here. Novela abolicionista was preferred over novela antiesclavista because it is the term that was in use historically. The term novela indigenista was preferred over novela indianista because it is more neutral. The novela indianista refers to romantic works that re-evaluate the historical past before the conquest of the Americas (Meléndez 1961Meléndez, Concha. 1961. La novela indianista en Hispanoamerica (1832–1889). Rio Piedras: Universidad de Puerto Rico.; Rössner 2007, 144Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.).
237. An example of an assignment made in the running text is: “De este episodio tomó apuntes el joven subteniente [Heriberto Frías], dándole base para construir una novela histórica a la que dió el nombre de Tomochic” (Fernández-Arias Campoamor 1952, 83 Fernández-Arias Campoamor, José. 1952. Novelistas de Mejico: esquema de la historia de la novela mejicana (de Lizardi al 1950). Madrid: Ediciones Cultura Hispánica. ). In Dill’s literary history of the Spanish-American novel, several subchapters are entitled with subgenre labels so that the texts mentioned in them can be attributed to these subgenres. The chapter “Der Roman der Romantik”, for example, is subdivided into “Der politische Roman”, “Der historische Roman”, “Der indianistische Roman”, “Der kubanische negristische Roman”, and “Der sentimentale Roman”. Works mentioned in the chapter “Der politische Roman” are, for example, “Amalia” (1855, AR) by José Mármol or “Clemencia” (1869, MX) and “El Zarco” (1901, MX) by Ignacio Manuel Altamirano (Dill 1999, 125–139Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.).
238. For example, Dill mentions Emilio Rabasa’s novels in the chapter on the naturalistic novel but designates them as anti-naturalistic (Dill 1999, 170Dill, Hans-Otto. 1999. Geschichte der lateinamerikanischen Literatur im Überblick. Stuttgart: Reclam.). In her work on the Spanish-American naturalistic novel, Schlickers dedicates her own subchapter to each of the novels that she included in her corpus. In these detailed discussions of the works, she reasons about how each of the novels is in accordance with the criteria that she set up for a novel to be naturalistic and, in some cases, comes to the conclusion that they are not, e.g., for the novel “León Zaldívar” (1888, AR) by Carlos María Ocantos: “Resulta que León Zaldívar no es una novela naturalista, sino una mezcla entre novela rosa/folletinesca y costumbrista; Lichtblau [...] califica la novela de ‘happy combination of romantic and realistic elements’. [...] A nivel de la expresión, la distancia respecto a la poética naturalista se marca por los frecuentes comentarios del narrador que marca su hic et nunc y coincide ideológicamente tanto con el autor implícito como con el protagonista idealizado [...]. A pesar de una escritura por lo general ‘realista’, no se concretan ni el tiempo de la historia [...], ni se citan nombres –por ejemplo de los políticos que se critican. Así, la novela gana en dimensión alegórica lo que pierde en valor referencial, facilitanto así la transmisión y recepción masiva de la intención de sentido: la crítica del materialismo, la reivindicación de la sincera práctica de la religión católica y la idealización de la mujer abnegada y sumisa, para terminar con una lección moralizante: [...]” (Schlickers 2003, 200Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.). In this way, Schlickers checks the novels that can be provisionally assigned to the naturalistic current because of their theme or certain generic signals against the more strict formal criteria that she set up for the subgenre.
239. This can be seen, for example, in the monograph “The Mexican historical Novel” authored by Read. In order to give a comprehensive overview of the Mexican nineteenth-century fiction that can be considered historical, he also includes cases that are only historical on certain levels of the narration, e.g.: “López Portillo y Rojas pronounced Altamirano’s Clemencia the best Mexican novel of its time. [...] The setting of the story is the region around Guadalajara in December, 1863, the year of the French occupation of Mexico. The historical material serves only as a frame for the actions of two officers in the army of the republic [...]. A conflict developed over the affections of a beautiful woman, with a none too pleasant result” (Read 1939, 164–166Read, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.) or “Martínez de Castro’s novel Eva, published in 1885, though not purely historical in nature, has enough of historical background to justify its inclusion in this study” (Read 1939, 256Read, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.).
240. See, for example, Rivas (1990, 121–154Rivas, Mercedes. 1990. Literatura y esclavitud en la novela cubana del siglo XIX. Sevilla: Escuela de Estudios Hispano-Americanos.) and Schlickers (2003, 27–46Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.).
241. An example of a study where several different subgenres of the novel are defined is Molina (2011Molina, Hebe Beatriz. 2011. Como crecen los hongos. La novela argentina entre 1838 y 1872. Buenos Aires: Teseo.). It is not a comparative study in a strict sense because rather than comparing the different subgenres, her goal is to describe and systematize the whole novelistic production in Argentina between 1838 and 1872. However, Molina finds that all the novels can be classified into one or several of four main types: “novelas históricas”, “novelas políticas”, “novelas socializadoras”, and “novelas sentimentales” (246–386Molina, Hebe Beatriz. 2011. Como crecen los hongos. La novela argentina entre 1838 y 1872. Buenos Aires: Teseo.).
242. In the description of the level “Objektbereich”, he mentions the Old French terms “matière de Bretagne” and “matière de Rome” that designate legendary material concerned with the history of Brittany and Rome, but there is no own level for identity-related (culturally specific or national) texts (Raible 1980, 343Raible, Wolfgang. 1980. “Was sind Gattungen? Eine Antwort aus semiotischer und textlinguistischer Sicht.” Poetica 12: 320–349.).
243. In the figure, the levels of medium and syntactic are connected because, in the context of novels, generic terms that refer to a medium are usually to be understood as indications of how the text is represented structurally, e.g., in the form of letters in epistolary novels. Many of the labels relating to medial aspects are more vague, for example, those originally referring to painting and drawing: “cuadros”, “bocetos”, “esbozos”, and “impresiones” describe figuratively how the text is represented.
244. Of course, there is research on the Cuban/Argentine/Mexican/Spanish-American novels, and also “national novels” (Sommer 1993Sommer, Doris. 1993. Foundational Fictions. The National Romances of Latin America. Berkeley: University of California Press.) are discussed, but these approaches are usually not related to explicit generic terms.
245. See also chapter 4.1.5, where a series of charts that display the distribution of subgenre labels is included.
246. The whole table is also available at https://github.com/cligs/data-nh/blob/master/corpus/bibliography-subgenre-labels/overview-subgenres.csv. Accessed March 28, 2020.
247. The creation and usage of (digital) corpora in linguistics is the subject of a whole subdiscipline, corpus linguistics. See Andresen and Zinsmeister (2019Andresen, Melanie, and Heike Zinsmeister. 2019. Korpuslinguistik. Tübingen: Narr Francke Attempto.) for a recent textbook on the topic. A comprehensive handbook including information about the history of the discipline, the compilation and different types of corpora, preprocessing, their use and exploitation, including statistical and computational methods, is Lüdeling and Kytö (2008Lüdeling, Anke, and Merja Kytö, eds. 2008. Corpus Linguistics: An International Handbook. 2 vols. Handbooks of Linguistics and Communication Science (HSK). Berlin, Boston: De Gruyter.). Some of the differences between linguistic and literary corpora are that the former also focus significantly on spoken language and that the principles of representativeness and sampling play a major role. Usually, the goal of linguistic corpora is to represent language (or a specific subsystem of language) as a whole in order to be able to get generalizable results when the corpora are analyzed. At the same time, except in extreme cases, it is not possible to record all the linguistic utterances that are relevant to a certain domain. Literary corpora, on the other hand, put their emphasis on written texts. In the literary domain, it is also often not possible to build a corpus comprising the whole population of textual production (because sources are lost or unavailable for other reasons). However, the gap between a corpus and the target domain is usually smaller. It is, for example, possible to compile the complete works of an author. As a consequence, regarding the creation and exploitation of corpora, literary studies can build on many of the research findings achieved in corpus linguistics, but there is still a need to adapt them.
248. One reason for the close relationship between literary corpora and scholarly editing is that a literary work is an abstract notion: a work is not necessarily manifest in a single document. There may be subsequent versions of works that can be viewed and compared to get a critical version serving as a base text, or a decision can be made for a specific historical version that is prepared with scholarly editorial methods. For the history of editorial scholarship in the philologies and different types of literary scholarly editions, see, for example, Sahle (2013, 107–224Sahle, Patrick. 2013. Digitale Editionsformen. Zum Umgang mit der Überlieferung unter den Bedingungen des Medienwandels. Vol. 2: Befunde, Theorie und Methodik. Schriften des Instituts für Dokumentologie und Editorik 8. Norderstedt: BoD.). Principles for creating corpora for the study of literary genres are discussed in Hempfer (1973, 128–136Hempfer, Klaus W. 1973. Gattungstheorie. Information und Synthese. München: Fink.) and Zymner (2003, 122–139Zymner, Rüdiger. 2003. Gattungstheorie. Probleme und Positionen der Literaturwissenschaft. Paderborn: mentis.; 2010, 23–25Zymner, Rüdiger, ed. 2010. Handbuch Gattungstheorie. Stuttgart: J.B. Metzler.).
249. From a more general perspective, the creation of data collections (including collections of spoken language and text, but also of movies, pictures, musical pieces, or other artifacts) for digital analysis is examined by Schöch (2017aSchöch, Christof. 2017a. “Aufbau von Datensammlungen.” In Digital Humanities. Eine Einführung, edited by Fotis Jannidis, Hubertus Kohle, and Malte Rehbein, 223–233. Stuttgart: J.B. Metzler.). Criteria for reviewing digital text collections in general have been developed by the Institute for Documentology and Scholarly Editing (IDE) (Henny and Neuber 2017Henny, Ulrike, and Frederike Neuber. 2017. “Criteria for Reviewing Digital Text Collections, version 1.0.” Institut für Dokumentologie und Editorik. https://web.archive.org/web/20230418162046/https://www.i-d-e.de/publikationen/weitereschriften/criteria-text-collections-version-1-0/. ). Nevertheless, Gius, Krüger, and Sökefeld (2019, 165Gius, Evelyn, Katharina Krüger, and Carla Sökefeld. 2019. “Korpuserstellung als literaturwissenschaftliche Aufgabe.” In DHd2019. Digital Humanities: multimedial & multimodal. Konferenzabstracts, Universitäten zu Mainz und Frankfurt, 25. bis 29. März 2019, edited by Patrick Sahle, 164–166. Frankfurt & Mainz: Verband Digital Humanities im deutschsprachigen Raum e.V. https://doi.org/10.5281/zenodo.2596095. ) point out that the preparation of corpora specifically for literary studies is still hardly reflected upon and that this entails certain risks, especially in the case of larger, digital corpora (e.g., unwanted correlations in the data).
250. The relationship between the population of novels, as approximated with the bibliographical database and the texts in the corpus, is outlined in chapter 4.1 below.
251. A better state of digitization would also allow using specific editions of the novels, e.g., only first or last editions. For this dissertation, no specific strategy for selecting editions could be followed because the main issue was having access to at least one edition of a novel in digital format.
252. In the script and data repositories, there are statistical charts related to these data, created with the following script, which was also used to create the following figures in this chapter: https://github.com/cligs/scripts-nh/blob/master/corpus/corpus-sources.py. The metadata about the sources of the novels is available at https://github.com/cligs/data-nh/blob/master/corpus/metadata_sources.csv. The resulting charts can be downloaded from the folder https://github.com/cligs/data-nh/tree/master/corpus/corpus-sources. Accessed January 29, 2023.
253. In this chapter, main sources and major sources mean sources that were especially important for the corpus at hand, whereas minor sources provided fewer relevant texts. These formulations do not characterize the sources in themselves. Wikisource, for example, is a minor source for this corpus but a major source for digital texts in general.
254. Links to the websites of the digital libraries, repositories, and institutions are given in table 46 (“Sources of the novels in the corpus”) in the appendix.
255. It can be inferred from the HTML tags that the underlying data format of the library is TEI, but unfortunately, the texts are currently not offered in that format to the public. Besides texts in HTML, this virtual library also contains texts downloadable as PDF files with images.
256. See chapter 3.3.2 for details about the text treatment. OCR stands for optical character recognition and is an umbrella term for procedures of the automatic conversion of images of text into machine-readable text.
257. Thanks to cooperation with the “Ibero-Amerikanisches Institut” and financial support from the project CLiGS, several novels were scanned by the library and added to the digital collections of the institute.
258. Links to digital versions of the novels’ editions are given in Bib-ACMé. See https://github.com/cligs/bibacme/blob/master/app/data/editions.xml. Accessed December 8, 2019. The collection of links is not exhaustive, though, also because the availability of digital texts and images changes over time.
259. The novels contained in these two repositories overlap to a great extent, probably because many of the scans were made by Google and uploaded to both. The “Internet Archive” is more permissive in that all the image files are downloadable also from outside the USA. The digitization work done by Google is impressive and of utmost importance for this dissertation.
260. There are, of course, also novels in the form of printed books, which were loaned from university libraries, scanned, and converted to digital images.
261. How the quality control was done is outlined in the next chapter 3.3.2 on text treatment.
262. Digital libraries and portals connected to libraries, universities, and governmental institutions were counted as scholarly. Individual, personal websites were subsumed under the general category. The texts taken from “Wikimedia Commons” are included in the general group here, but ultimately they have a scholarly background, as well, because they are all digital reproductions of novels held by the “Academia Argentina de Letras” (Wikimedia Commons 2019Wikimedia Commons. 2019. “Category:Files from Academia Argentina de Letras.” Accessed March 16, 2020. https://commons.wikimedia.org/wiki/Category:Files_from_Academia_Argentina_de_Letras.). That the Argentine Academy decided to upload the PDF files of the novels (and other texts) to “Wikimedia Commons” for the benefit of the general public is exemplary.
263. Details about which sources have changed during the work on this dissertation can be found in table 46 in the appendix.
264. For details, see the next chapter 3.3.2 on text treatment.
265. Most of the modern editions are from scholarly sources due to the books loaned from the university libraries, which were almost exclusively modern editions. Very old editions can usually not be loaned in the German library system. In addition, digital repositories tend to offer older editions for which no copyright issues are to be expected.
266. See, for example, Paz ([1883] 2017Paz, Ireneo. (1883) 2017. Doña Marina. Berlin: Ibero-Amerikanisches Institut – Preußischer Kulturbesitz. http://resolver.iai.spk-berlin.de/IAI00006A0B00000000.).
267. Tests were also done with the free software Tesseract, but the results were not as satisfying as the ones achieved with ABBYY Finereader.
268. For an overview of the shares of the different types of source editions in the corpus, see the previous chapter 3.3.1 above.
269. See chapter 4.2.1 about the textual features used.
270. An example of a large-scale project aiming at the creation of digital editions suitable for historical linguistic analyses is the “German Text Archive” (“Deutsches Textarchiv”) (BBAW 2022Berlin-Brandenburgische Akademie der Wissenschaften, ed. 2022. “Deutsches Textarchiv. Grundlage für ein Referenzkorpus der neuhochdeutschen Sprache.” DTA. Accessed November 6, 2022. https://web.archive.org/web/20221106163539/https://www.deutschestextarchiv.de/.).
271. In “Project Gutenberg”, for example, all the text of a novel is presented on a single page. On the platforms “Wikisource” and “Biblioteca Virtual Miguel de Cervantes”, in contrast, a novel is presented on several pages (usually by chapter on “Wikisource” and apparently arbitrary divisions in the “Biblioteca Virtual Miguel de Cervantes”). As examples, see Cambaceres (2008Cambaceres, Eugenio. 2008. “Sin rumbo.” Wikisource. https://web.archive.org/web/20230422143111/https://es.wikisource.org/wiki/Sin_rumbo., [1885] 2000Cambaceres, Eugenio. (1885) 2000. Sin rumbo (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmc1v5d1.), and Ocantos ([1913] 2007Ocantos, Carlos María. (1913) 2007. “Quilito.” Project Gutenberg. https://web.archive.org/web/20230422145502/https://www.gutenberg.org/files/23035/23035-h/23035-h.htm.).
272. The script is available at https://github.com/cligs/scripts-nh/blob/master/corpus/text_treatment/clean.xsl. Accessed March 17, 2020.
273. For parts and chapters of the novels, it was checked if all the units were there, for example, by ensuring successive chapter numbers. For the paragraphs, the check was done in a rough manner. For instance, it was tested whether there were paragraphs starting with a lower-case letter which was often a sign that there was a superfluous paragraph boundary.
274. Other historical paratexts such as tables of content, lists of errata, or advertisements of other publications were dropped. See chapter 3.3.4 below about the assignment of subgenre labels to the novels in the corpus.
275. A bar chart analyzing the gaps was produced with the script https://github.com/cligs/scripts-nh/blob/master/corpus/text_treatment/check-gaps.xsl. The resulting data can be downloaded at https://github.com/cligs/data-nh/tree/master/corpus/text-treatment. Accessed January 29, 2023.
276. The number of missing items was estimated depending on the extent of the missing text on the page.
277. The script is available at https://github.com/cligs/scripts-nh/blob/master/corpus/text_treatment/spellchecking.py. It was also used to create the charts about spelling errors included below. Accessed March 17, 2020. The core function for the spell check and the functions for the visualization of errors were written by the author of this dissertation. An additional function in the module for the automatic correction of errors was written by Christof Schöch. The idea behind the spell check and results from a preliminary corpus of Spanish-American novels and from a corpus of French novels are presented in Henny-Krahmer and Schöch (2016Henny-Krahmer, Ulrike, and Christof Schöch. 2016. “How good are our texts, really? Quality assurance for literary texts from various sources.” CLiGS – Computergestützte literarische Gattungsstilistik. https://web.archive.org/web/20230422152455/https://cligs.hypotheses.org/371.).
278. As an aside, the results of the spell check can also be used for purposes other than controlling the orthography. The words that occur in the check files but are not real errors indicate how many and how many different proper names the novels contain, how many foreign words, how many words from special areas of vocabulary, and so on, so they are also interesting from a stylistic point of view.
279. The CSV file containing the results of the spell check is available at https://github.com/cligs/data-nh/blob/master/corpus/text-treatment/spellcheck.csv. Accessed March 22, 2020.
280. The size of the whole vocabulary is 197,520. This number was determined with the script https://github.com/cligs/scripts-nh/blob/master/features/bow.py. Accessed March 19, 2020.
281. The whole corpus contains 17,104,856 tokens (see the script in the previous footnote).
282. I.e., tokens that occur only once in the whole corpus.
283. These conclusions were already drawn in Henny-Krahmer and Schöch (2016Henny-Krahmer, Ulrike, and Christof Schöch. 2016. “How good are our texts, really? Quality assurance for literary texts from various sources.” CLiGS – Computergestützte literarische Gattungsstilistik. https://web.archive.org/web/20230422152455/https://cligs.hypotheses.org/371.).
284. For the spell check, all the tokens were converted to lowercase.
285. The web source for the lists
of proper names and surnames was Olea (2021Olea, Ismael. 2021. “Lemarios y listas de palabras del
español.” GitHub.com. https://web.archive.org/web/20230609200732/https://github.com/olea/lemarios.). From that source, the files
“nombres-propios-es.txt” and “apellidos-es.txt” were used. The names of the
countries were obtained from Wikipedia (2022Wikipedia. 2022. “Anexo:Nombres de países en español.”
Wikipedia. https://web.archive.org/web/20230422153154/https://es.wikipedia.org/wiki/Anexo:Nombres_de_pa%C3%ADses_en_espa%C3%B1ol.). The list of capitals was retrieved
from Frech (1998–2017Frech, Susana. 1998–2017. “Países, sus capitales y
gentilicios* correspondientes Es < En.” Susana Frech. Traducciones
Profesionales. https://web.archive.org/web/20210615063220/http://www.susana-translations.de/paises.htm.).
To compare these lists to the spell check error list, the function
generate_exception_list()
in the script “spellchecking.py”
was used. See footnote 277. The
resulting exception lists (“exceptions-proper-names.txt”,
“exceptions-surnames.txt”, “exceptions-countries.txt”,
“exceptions-capitals.txt”) are contained in https://github.com/cligs/data-nh/tree/master/corpus/text-treatment/exception-words.
Accessed March 28, 2020.
286. The numbers
were calculated with the function interprete_exception_list()
in the module “spellchecking.py”. In the table, the relative numbers were
rounded.
287. See chapter 3.3.5, where the linguistic annotation of the corpus files is described.
288. A version of the above list of verb form patterns with disassembled expressions is available at https://github.com/cligs/data-nh/blob/master/corpus/text-treatment/exception-words/source-lists/verb-form-patterns-detail-es.txt. Accessed March 31, 2020. Already this list comprises 442 different expressions. In reality, though, not all theoretically possible combinations necessarily occur in the language and in language use, so it would be even more work to create a list of verb forms with pronoun suffixes that is, on the one hand, complete and, on the other hand, adequate for the linguistic reality. The one created here is only an approximation of such a list.
289. The table is
sorted by the number of error tokens covered. The files with the
corresponding patterns are named “verb-form-patterns-es.txt”,
“diminutive-patterns-es.txt”, “superlative-patterns-es.txt”, and
“adverb-patterns-es.txt”. They are contained in https://github.com/cligs/data-nh/tree/master/corpus/text-treatment/exception-words/source-lists.
To create the exception lists from the patterns, the function
generate_exception_list()
in the module “spellchecking.py”
was used. The resulting exception lists can be viewed at https://github.com/cligs/data-nh/tree/master/corpus/text-treatment/exception-words.
The function interprete_exception_list()
in the same module
served to evaluate how many errors are covered by each exception list. The
links were accessed on March 31, 2020.
290. The table is
sorted by the number of error tokens covered. The files with the
corresponding exception lists are named “exceptions-proper-names_ext.txt”,
“exceptions-surnames_ext.txt”, “exceptions-other.txt” and
“exceptions-places.txt”, “exceptions-countries_ext.txt”,
“exceptions-foreign.txt”, “exceptions-special.txt”, “exceptions-oral.txt”,
and “exceptions-archaic.txt”. They are contained in https://github.com/cligs/data-nh/tree/master/corpus/text-treatment/exception-words.
The function interprete_exception_list()
in the same module
served to evaluate how many errors are covered by each exception list. The
links were accessed on March 31, 2020.
291. The manually edited list of places covers place names other than countries and capitals.
292. The list of errors that remained after including exception words is available at https://github.com/cligs/data-nh/blob/master/corpus/text-treatment/spellcheck_exc.csv. The charts were produced with the module “spellchecking.py” and are available as HTML files at https://github.com/cligs/data-nh/tree/master/corpus/text-treatment (“coverage-exception-lists.html”, “distribution-spelling-errors-exc.html”, “distribution-spelling-errors-files.html”, “distribution-spelling-errors-files-relative.html”, “distribution-spelling-errors-files-editiontype.html”, “distribution-spelling-errors-files-filetype.html”, “distribution-spelling-errors-files-institution.html”). Accessed April 4, 2020.
293. For all of these figures, values relative to text length were used.
294. See the spell-check result file for this novel at https://github.com/cligs/data-nh/blob/master/corpus/text-treatment/spellcheck_nh0215.csv. Accessed April 5, 2020.
295. See, for example, the “Corpus of German-Language Fiction”, consisting of almost 3,000 prose works in plain text format or the corpora prepared by the Computational Stylistics Group (Fischer and Strötgen 2017Fischer, Frank, and Jannik Strötgen. 2017. “Corpus of German-Language Fiction (txt).” figshare. https://doi.org/10.6084/m9.figshare.4524680.v1., Computational Stylistics Group 2023Computational Stylistics Group. 2023. “Resources.” https://web.archive.org/web/20230423092924/https://computationalstylistics.github.io/resources/.).
296. The overall approach that was used in the CLiGS project to encode literary corpora in TEI is described in Calvo Tello, Henny-Krahmer, and Schöch (2018Calvo Tello, José, Ulrike Henny-Krahmer, and Christof Schöch. 2018. “Textbox. Análisis del léxico mediante corpus literarios.” In Historia del léxico español y Humanidades Digitales, edited by Dolores Corbella, Alejandro Fajardo, and Jutta Langenbacher, 225–253. Berlin: Peter Lang.) and Schöch et al. (2019Schöch, Christof, José Calvo Tello, Ulrike Henny-Krahmer, and Stefanie Popp. 2019. “The CLiGS Textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in TEI XML.” Journal of the Text Encoding Initiative. Rolling Issue. https://doi.org/10.4000/jtei.2085.).
297. Obviously, for the corpus at hand, three digits would have been enough, but it was decided to use four because other corpora in the CLiGS project are more extensive. In addition, future extensions of the corpora should be possible.
298. For general documentation of the TEI header, see the corresponding chapter in the TEI guidelines (Text Encoding Initiative Consortium 2023cText Encoding Initiative Consortium. 2023c. “The TEI Header and Its Components.” In: TEI P5: Guidelines for Electronic Text Encoding and Interchange, 22–23. Version 4.6.0. Revision f18deffba. https://web.archive.org/web/20230423102337/https://tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf.).
299. An example of a series title
is “Dramas militares”, because there are several novels by the
Argentine writer Eduardo Gutiérrez associated with this label, e.g.,
“El Chacho” (1884, AR) and its sequels. An alternative title means
that the novel has been published under different titles. Sometimes,
the title of a novel changes from the first edition to subsequent
ones. For example, the novel “Amar al vuelo” (1884, AR) by Enrique E.
Rivarola was first published with the title “El arma de Werther”.
Furthermore, different editions of the novels often have different
subtitles, but these are encoded as several titles of the type <title type="sub">
and not as alternative main
titles. The tag <title type="part">
is only used
for one special case in the corpus: the novel “Pepa Larrica” (1884,
AR) by Rafael Barreda was interpreted as one work consisting of three
parts that were published separately with their own title (“Las dos
tragedias”, “La confesión de un médico”, and “Religión o muerte”). See
chapter
3.1.1.5 above, where this decision is explained. Where novels
are published under one main title but with several parts (“Parte
primera: ...”, “Parte segunda:...”, etc.), the titles of the parts are
only encoded as headings in the main body of the text and not in the
title statement of the TEI header.
300. The VIAF entry of Álvaro de la Iglesia is available at OCLC (2010–2021bOCLC. 2010–2021b. “VIAF. Virtual International Authority File.” https://web.archive.org/web/20230423111630/https://viaf.org/.).
301. See chapter 3.3.5 below for further information about the publication of the corpus.
302. Figures 24 to 27 were produced with the following Python module: https://github.com/cligs/scripts-nh/blob/master/corpus/metadata_encoding/corpus_copyright.py. The resulting charts “authors-death-years.html”, “first-publication-years.html”, “base-publication-years.html”, and “copyright-status.html” can be downloaded at https://github.com/cligs/data-nh/tree/master/corpus/metadata-encoding. Accessed April 6, 2020.
303. Until then, the TEI file of this novel is kept in a private repository that is part of the GitHub space of the CLiGS project: https://github.com/cligs/novelashispanoamericanas. Accessed March 20, 2020. The novel in question is “La gloria de Don Ramiro” (1908, AR) by Enrique Larreta.
304. In the law, the rule is formulated as applying to anonymous and pseudonymous works for which the author’s name cannot be verified. Here, the names of the authors are known, but their dates of death are not, so the regular law cannot be applied (Bundesamt für Justiz n.d.bBundesamt für Justiz. n.d.b “Gesetz über Urheberrecht und verwandte Schutzrechte (Urheberrechtsgesetz). § 66 Anonyme und pseudonyme Werke.” Gesetze im Internet.https://web.archive.org/web/20230423112720/https://www.gesetze-im-internet.de/urhg/__66.html.).
305. The authors concerned are Ventura Aguilar (?–?, AR), C. M. Blanco (?–?, AR), Rodolfo Díaz Olazábal (?–?, AR), Silverio Domínguez (1852–?, AR), José Rafael Guadalajara (1863–?, MX), Ramón Machali (?–?, AR), Vicente Morales (?–?, MX), Pedro G. Morante (?–?, AR), Margarita Rufina Ochagavia (1840–?, AR), Andrés Portillo (?–?, MX), Pedro Robles (?–?, MX), Mercedes Rosas de Rivera (?–?, AR), and Victorio Sylva (?–?, AR). Some of the dates could possibly be found out with more rigorous historical research, but checking VIAF, the bibliographies mentioning the works of these authors as well as general searches on the web, did not lead to any results.
306. The difference between the latter two is not always easily identified. Besides the characteristics of the digital edition itself, it was also taken into account here if copyright is claimed by the editors of the digital edition or not. This is discussed in more detail below for the novels in question.
307. The works in question are the following (the dates in parentheses indicate the year of the first edition/year of the digital or print edition used/year of the expiration of the protection): “Astucia” (1866/2005/2030, MX) by Luis Gonzaga Inclán, “Dos partidos en lucha” (1875/2005/2030, AR) and “El tipo más original” (1879/2001/2026, AR) by Eduardo Ladislao Holmberg, “El espejo de Amarilis” (1902/2011/2036, MX) by Laura Méndez de Cuenca, “Antón Pérez” (1903/2011/2036), “Juanita Sousa” (1890/2011/2036, MX), “Pocahontas” (1882/2011/2036) and “Previvida” (1906/2011/2036, MX) by Manuel Sánchez Mármol, “Clemencia” (1877/2012/2037, AR) and “La huella del crimen” (1877/2009/2034, AR) by Luis Vicente Varela, “María de Montiel” (1861/2010/2035), AR) by Mercedes Rosas de Rivera, and “Stella” (1905/2011/2036, AR) by Emma de la Barra. In principle, it would be possible to examine in detail to what extent the used print editions are scholarly editions that differ significantly from previous editions. However, for the sake of simplicity and to avoid legal ambiguities, all the affected TEI files are kept unpublished until the ancillary copyright expires.
308. This applies to 22 novels obtained from the following sources: “Wikisource” (6 novels), “Biblioteca Digital Argentina” (4), “Biblioteca Virtual Antorcha” (4), “El Libro Total” (3), “Project Gutenberg” (2), “Autores de Concordia” (1), “EnCaribe” (1), “Individual website” (1).
309. The following novels are concerned: “Antonia” (1872, MX) by Ignacio Manuel Altamirano, “Confesiones de un pianista” (1873, MX) by Justo Sierra Méndez, “Historia vulgar” (1904, MX) by Rafael Delgado, “Los fuereños” (1883, MX) by José Tomás de Cuéllar, “Los maduros” (1882, MX) by Pedro Castera, and “¡Vendía cerrillos!” (1889, MX) by Federico Gamboa. Four were edited in 2009, one in 2010, and one in 2018, so the ancillary copyright will cease in 2034, 2035, and 2043, respectively. In the virtual library, five of these novels are contained in the collection “Novelas en tránsito – Primera serie”. Unfortunately, these digital editions are not retrievable anymore. Another related collection containing the sixth novel is still accessible: “Novelas en tránsito – Segunda serie”. There, it can be seen that the editions of the portal are prepared according to scholarly standards and that copyright is claimed by the “Universidad Nacional Autónoma de México” in the PDF versions of the editions (see, for instance, Sierra 2018Sierra, Justo. 2018. Confesiones de un pianista. Edited by Karla Ximena Salinas Gallegos. La novela corta. Una biblioteca virtual. Novelas en tránsito. Segunda serie. México: Universidad Nacional Autónoma de México. https://web.archive.org/web/20200322100041/http://www.lanovelacorta.com/novelas-en-transito-2/confesiones-de-un-pianista.pdf.).
310. A tabular overview of the corpus metadata that is relevant for copyright questions is available at https://github.com/cligs/data-nh/blob/master/corpus/metadata_copyright.csv. Accessed April 6, 2020.
311. The external list is a file named “bibliography.xml”. In it, the bibliographic references cited for abstracts and subgenre assignments throughout the corpus are collected. The filename is not used as part of the pointer in order to keep the references short. See https://github.com/cligs/conha19/blob/master/bib/bibliography.xml. Accessed April 8, 2020.
312. See chapter 3.1.2 above for details about how the authors were assigned to the countries.
313. The filetype is only categorized roughly into “image” or “text” instead of more detailed information such as “HTML”, “PDF”, etc. because the distinction between image- and text-based filetypes is the most relevant one affecting the way the texts had to be prepared to be included in the corpus (see the previous chapter 3.3.2 on text treatment). Filetypes such as HTML or PDF are not necessarily informative in this regard because both can contain the novel as text or images. See, for example, the short novel “La baronesa de Joux” by Gertrudis Gómez de Avellaneda, which is offered in an HTML format that consists of structural information with embedded images in the “Biblioteca Virtual Miguel de Cervantes” (Avellaneda [1871] 2008Avellaneda, Gertrudis Gómez de. (1871) 2008. La Baronesa de Joux: leyenda fundada en una tradición francesa (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmcng4q2 .). As for the type of edition, “historical” refers to editions published in the period covered by the bibliography and the corpus (1830–1910), and “modern” refers to editions published after 1910. The list of institutions is, in principle, open and is only controlled to make sure that the institutional names are not differently spelled when they are used.
314. This metadata was already analyzed above in chapter 3.3.1 on the selection of novels and sources for the corpus.
315. In some definitions of the novel, an independent publication of the text is a necessary feature. See chapter 3.1.1.5 above.
316. It is assumed that it is less likely for novels that were not published independently to enter the literary canon.
317. See the corresponding novels “El guajiro” (1842, CU) by Cirilo Villaverde, “La cruz y la espada” (1866, MX), and “El filibustero” (1864, MX) by Eligio Ancona at https://github.com/cligs/conha19/blob/master/tei/nh0001.xml, https://github.com/cligs/conha19/blob/master/tei/nh0026.xml, and https://github.com/cligs/conha19/blob/master/tei/nh0180.xml respectively. Accessed March 24, 2020.
318. The ways in which literary prestige can be measured have been reflected in the context of a joint research project on “Computational Approaches to Complexity in Literary Texts” between the Universities of Osaka (led by Tomoji Tabata) and Würzburg (led by Fotis Jannidis) and funded by the German Academic Exchange Service (DAAD) and the Japan Society for the Promotion of Science (JSPS) from 2017 to 2019. The way to measure the prestige of the Spanish-American nineteenth-century novels here results from what was reflected in that project, in which the author of this dissertation participated. In detail, the following rules were set up for the search in the WorldCat: (1) all kinds of republications were counted, whether scholarly or general, printed or digital; (2) the only exception being digital editions of the IAI in Berlin dated to 2017 because these are scans of the novels that were commissioned by myself, and they do therefore not reflect the general prestige that the novels have gained; (3) the complete works of an author were neglected, meaning that a novel that was only republished as part of complete works is still considered as low prestige. The assumption behind this decision is that complete works show an interest in an author and in his or her work as a whole but do not necessarily imply that all the individual works are valued highly; (4) for sequels, it was considered enough to find a reprint of (the title of) one (often the first) part because works published in several parts originally are often published together in later editions; (5) for works that were originally published dependently it was also looked up if they were republished that way (for example, as part of a collection of selected works). The search in the WorldCat was performed on June 4, 2020.
319. See also chapter 4.1.3.2, where an overview of the novels in the corpus is given.
320. The taxonomy file is available at https://github.com/cligs/conha19/blob/master/schema/keywords.xml. Accessed March 24, 2020.
321. For more information about Schematron, see chapter 3.2.2 above, where the data model of the digital bibliography is explained. The Schematron file checking the corpus metadata can be viewed at https://github.com/cligs/conha19/blob/master/schema/keywords.sch. Accessed March 24, 2020.
322. The general corpus schema is commented on further below after the discussion of the TEI encoding of the textual body. There it is also explained how the schema files are processed to check the whole corpus and how errors are reported.
323. Editions are considered “historical” here if they were published within the chronological scope of the bibliography and corpus (1830–1910).
324. See the next chapter 3.3.4 on the assignment of subgenre labels to the novels in the corpus for details.
325. For a characterization of the paragraph from a linguistic point of view, see Rinas (2015Rinas, Karsten. 2015. “Zum linguistischen Status des Absatzes.” Aussiger Beiträge: germanistische Schriftenreihe aus Forschung und Lehre 9: 139–157.).
326. The anonymous block element is described as “any arbitrary component-level unit of text, acting as an anonymous container for phrase or inter level elements analogous to, but without the semantic baggage of, a paragraph” (Text Encoding Initiative Consortium 2023dText Encoding Initiative Consortium. 2023d. “<ab>.” In: TEI P5: Guidelines for Electronic Text Encoding and Interchange, 868–870. Version 4.6.0. Revision f18deffba. https://web.archive.org/web/20230423102337/https://tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf.).
327. See also chapter 3.3.2 on text treatment above on this topic.
328. In the following metadata file, it is listed in which novels direct speech and thought were encoded: https://github.com/cligs/data-nh/blob/master/corpus/metadata_direct-speech.csv. The metadata file was produced with the script https://github.com/cligs/scripts-nh/blob/master/corpus/metadata_encoding/direct-speech-metadata.xsl. Accessed April 20, 2020.
329. For such machine learning approaches, see, for instance, Brunner (2013Brunner, Annelen. 2013. “Automatic recognition of speech, thought, and writing representation in German narrative texts.” Literary and Linguistic Computing. https://doi.org/10.1093/llc/fqt024., 2015Brunner, Annelen. 2015. Automatische Erkennung von Redewiedergabe: ein Beitrag zur quantitativen Narratologie. Narratologia: contibutions to narrative theory. Vol. 47. Berlin, Boston: De Gruyter.), Byszuk et al. (2020Byszuk, Joanna, Michal Woźniak, Mike Kestemont, Albert Leśniak, Wojciech Łukasik, Artjoms Šeļa, and Maciej Eder. 2020. “Detecting Direct Speech in Multilingual Collection of 19th-century Novels.” In Proceedings of LT4HALA 2020 – 1st Workshop on Language Technologies for Historical and Ancient Languages, 100–104. Marseille: European Language Resources Association (ELRA). https://web.archive.org/web/20230611135104/https://aclanthology.org/2020.lt4hala-1.15.pdf), Jannidis et al. (2018Jannidis, Fotis, Leonard Konle, Albin Zehe, Andreas Hotho, and Markus Krug. 2018. “Analysing Direct Speech in German Novels.” In DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts, Köln, 26.2.-2.3.2018, edited by Georg Vogeler, 114–118. Köln: Universität zu Köln. https://doi.org/10.5281/zenodo.4622454.), and Schöch, Schlör, et al. (2016Schöch, Christof, Daniel Schlör, Stefanie Popp, Annelen Brunner, Ulrike Henny, and José Calvo Tello. 2016. “Straight Talk! Automatic Recognition of Direct Speech in Nineteenth-Century French Novels.” In Digital Humanities 2016: Conference Abstracts, 346–353. Kraków: Jagiellonian University and Paedagocial University. https://web.archive.org/web/20230325081511/https://dh2016.adho.org/abstracts/31.). An alternative to a machine learning approach would be to apply the simple regular expression approach to the other two-thirds of the novels without manual correction and to accept the resulting error rate.
330. This element is only used in the files in which also direct speech is
annotated with the element <said>
to be able to
evaluate how many tokens contained in quotation marks are mistaken as
direct speech.
331. In linguistics, several functional text types have been distinguished, for example, descriptive, narrative, expository, argumentative, and instructive text (Werlich 1975, 30–34Werlich, Egon. 1975. Typologie der Texte. Entwurf eines textlinguistischen Modells zur Grundlegung einer Textgrammatik. Heidelberg: Quelle & Meyer.).
332. The script is available at https://github.com/cligs/scripts-nh/blob/master/corpus/metadata_encoding/copy-all-but-said.xsl. Accessed April 16, 2020.
333. The pages were obtained from the 1894 edition of the novel available at Sicardi ([1894] 2016Sicardi, Francisco. (1894) 2016. “File:Libro extraño I — Francisco A. Sicardi.pdf.” Wikimedia Commons. http://web.archive.org/web/20230620100511/https://commons.m.wikimedia.org/wiki/File:Libro_extra%C3%B1o_I_-_Francisco_A._Sicardi.pdf.).
334. The script with which the tokenized TEI files and direct speech stand-off mark-up were produced is available at: https://github.com/cligs/scripts-nh/blob/master/corpus/metadata_encoding/evaluation-direct-speech.xsl. As a basis for this script, on the one hand, the TEI master files of the corpus that contain checked direct speech mark-up were used (as available at https://github.com/cligs/conha19/tree/master/tei). On the other hand, versions of the same TEI files without direct speech markup (available at https://github.com/cligs/conha19/tree/master/tei_ns) were treated with the pure regular expression approach producing a second version with direct speech markup (available at https://github.com/cligs/conha19/tree/master/tei_ds). These two versions of TEI files were then evaluated with the mentioned script. All links were accessed on August 16, 2020.
335. These calculations, as well as figures 30 to 32, were produced with the same script (see footnote 334). An overview of the scores in CSV format is available at https://github.com/cligs/data-nh/blob/master/corpus/metadata-encoding/direct-speech-evaluation-F1.csv. Accessed August 16, 2020.
336. An F1 score of 0.939 has been reported for the recognition of direct speech in nineteenth-century French novels (Schöch, Schlör, et al. 2016Schöch, Christof, Daniel Schlör, Stefanie Popp, Annelen Brunner, Ulrike Henny, and José Calvo Tello. 2016. “Straight Talk! Automatic Recognition of Direct Speech in Nineteenth-Century French Novels.” In Digital Humanities 2016: Conference Abstracts, 346–353. Kraków: Jagiellonian University and Paedagocial University. https://web.archive.org/web/20230325081511/https://dh2016.adho.org/abstracts/31.), and an accuracy of 0.9 for German novels (Jannidis et al. 2018Jannidis, Fotis, Leonard Konle, Albin Zehe, Andreas Hotho, and Markus Krug. 2018. “Analysing Direct Speech in German Novels.” In DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts, Köln, 26.2.-2.3.2018, edited by Georg Vogeler, 114–118. Köln: Universität zu Köln. https://doi.org/10.5281/zenodo.4622454.). Brunner achieved an F1 score of 0.87 for the recognition of direct speech, thought, and writing representation in German narrative texts (Brunner 2013Brunner, Annelen. 2013. “Automatic recognition of speech, thought, and writing representation in German narrative texts.” Literary and Linguistic Computing. https://doi.org/10.1093/llc/fqt024.). In their approach to a multilingual collection of nineteenth-century novels, Byszuk et al. report F1 scores ranging between 0.65 and 0.98 for the different languages when comparing the results of a regular expression approach with manually annotated samples. In their multilingual deep learning-based approach, they achieve a general F1 score of 0.873 (Byszuk et al. 2020Byszuk, Joanna, Michal Woźniak, Mike Kestemont, Albert Leśniak, Wojciech Łukasik, Artjoms Šeļa, and Maciej Eder. 2020. “Detecting Direct Speech in Multilingual Collection of 19th-century Novels.” In Proceedings of LT4HALA 2020 – 1st Workshop on Language Technologies for Historical and Ancient Languages, 100–104. Marseille: European Language Resources Association (ELRA). https://web.archive.org/web/20230611135104/https://aclanthology.org/2020.lt4hala-1.15.pdf).
337. In the metadata of the corpus, four kinds of source editions are differentiated: “first”, (other) “historical”, “modern”, and “unknown”. Here, the two categories, “first” and “historical”, are joined.
338. These are “Lucía Miranda” (1860, AR) by Eduarda Mansilla de García, “Pot-pourri” (1882, AR) by Eugenio Cambaceres, and “El fatalista” (1866, CU) by Estebán Pichardo y Tapia. In the corpus, the text of the first novel is based on an edition published in Buenos Aires in 1882, the second one on an edition published in Argentina in 1984, and the third one on the first edition published in Havanna. In all three, double angular quotation marks are used instead of the usual single hyphen.
339. It would not be possible to use a
simple division element (<div>
) for that purpose
because, following the rules of the TEI, any divisional structure
opened inside another one has to be continued. That is, it would cause
an error to continue with paragraphs of the running chapter after the
insertion of a division for an embedded text.
340. The TEI standard offers many different modules for encoding all types of texts. Normally, projects dealing with a specific type of text define a subset of the TEI’s modules to work with. Such a subset can be documented in a so-called ODD (“One document does it all”) file, a format that is used to express TEI customizations and that is independent of other formal schema languages. Schemas in different languages can be generated from the ODD in a second step. RELAX NG is one schema language that is suitable for XML (Murata 2014Murata, Makoto. 2014. “RELAX NG home page.” https://web.archive.org/web/20230604105524/https://relaxng.org/.; Text Encoding Initiative Consortium n.d.bText Encoding Initiative Consortium. n.d.b “Getting Started with P5 ODDs.” https://web.archive.org/web/20230423104437/https://tei-c.org/favicon.ico.)
341. See https://github.com/cligs/conha19/blob/master/schema/keywords.sch on GitHub. Accessed August 19, 2020.
342. See https://github.com/cligs/reference. Accessed August 20, 2020.
343. See the examples in chapter 3.2.3 above.
344. The output of the FreeLing tagger is integrated into the TEI structure of the main corpus files, and the results are stored as a derivative TEI format, as explained further in chapter 3.3.5 below.
345. The Python file for validation is accessible at https://github.com/cligs/scripts-nh/blob/master/corpus/metadata_encoding/validate_tei.py. The resulting log file can be viewed at https://github.com/cligs/conha19/blob/master/schema/log-rng.txt. Accessed August 20, 2020.
346. Here, the compilation of the Schematron file as XSLT was
done with the software Oxygen by choosing the preconfigured
Transformation Scenario “ISO Schematron to XSLT (compile)”. The resulting
XSLT file can be viewed at https://github.com/cligs/conha19/blob/master/schema/keywords-compiled.xsl.
The subsequent command line call for saxon is: java
-jar /home/ulrike/Programme/saxon/saxon9he.jar
-s:/home/ulrike/Git/conha19/tei/
-o:/home/ulrike/Git/conha19/tei-checked/
-xsl:/home/ulrike/Git/conha19/schema/keywords-compiled.xsl >
/home/ulrike/Git/conha19/schema/log-schematron.txt
. The
resulting Schematron log file is accessible at https://github.com/cligs/conha19/blob/master/schema/log-schematron.txt.
If the file is empty, now errors were found. Accessed August 20,
2020.
347. Generic signals can occur in the entire text of literary works, but the opening is the most prominent place for them: “The generic markers that cluster at the beginning of a work have a strategic role in guiding the reader. They help to establish, as soon as possible, an appropriate mental ‘set’ that allows the work’s generic codes to be read. One might call them the key words of the code, although they may serve this purpose at an unconscious level, or at least beneath the level of attention” (Fowler 1982, 88Fowler, Alastair. 1982. Kinds of Literature. An Introduction to the Theory of Genres and Modes. Oxford: Clarendon Press.).
348. See chapter 3.3.3.1.6 for a general explanation of the encoding of metadata in the keyword section.
349. See the etymological information on “rastaquouère” in the lexical portal of the Centre National de Ressources Textuelles et Lexicales: “personne méprisable [...] tanneur, grossiste en peaux, en cuirs [...] Le sens péj. du fr. est prob. dû au fait que beaucoup de Sud-américains à l'élégance tapageuse qui séjournaient à Paris à la fin du xixes. devaient leur fortune récente au commerce des cuirs et peaux” (Centre National de Ressources Textuelles et Lexicales (CNRTL) 2012Centre National de Ressources Textuelles et Lexicales (CNRTL). 2012. “RASTAQUOUÈRE.” Portail lexical. https://web.archive.org/web/20230611105723/https://www.cnrtl.fr/etymologie/rastaquou%C3%A8re.).
350. See also chapter 3.3.3.1 on the TEI encoding of front matters.
351. See chapter 3.2.2 on the question of generic identity and entities of intellectual creations.
352. A “china” is a person with indigenous traits or of a different race; a “lépero” is an indecent, ordinary person; “polla” is probably a colloquial designation for a young woman; a “chinaco” is a pejorative name for a liberal guerrilla fighter; a “tendero” is the owner of a grocery store. See the definitions of these terms in the Spanish Royal Academy’s “Diccionario de la lengua española” (Real Academia Española (RAE) 2023aReal Academia Española (RAE). 2023a. “Diccionario de la lengua española. (DLe).” Accessed June 11, 2023. https://dle.rae.es/.).
353. See chapter 3.3.3.1 above for details about the novels’ copyright status.
354. The XSLT file is available at https://github.com/cligs/scripts-nh/blob/master/corpus/derivative_formats/get-plaintext.xsl. Accessed August 23, 2020.
355. The client/server mode is advantageous if many small files are processed. It was used here because the novels were treated per paragraph, resulting in 531,006 plain text snippets to be analyzed for the whole corpus.
356. In the corpus, paragraph-like structures include paragraphs, verse lines, and headings other than part and chapter headings.
357. WordNet lexnames are lexicographer files into which the synsets are organized and consist of syntactic categories (nouns, verbs, adjectives, adverbs) and logical groupings (e.g., nouns denoting animals versus body parts). They add more lexical information to the synsets (Princeton University 2023 Princeton University. 2023. “lexnames(5WN).” WordNet. A Lexical Database for English. https://web.archive.org/web/20230610175939/https://wordnet.princeton.edu/documentation/lexnames5wn.).
358. The workflow for the linguistic annotation was designed as part of the work in the CLiGS project. Various members of the group were involved in the programming of its different parts, as indicated in the Python files. The workflow consists of three Python files: “workflow_teifw.py” (for settings and to start the process) and the two modules “prepare_tei.py” (for pre-processing and post-processing of the FreeLing annotation) and “annotate_fw.py” (for the FreeLing annotation itself and WordNet calls). The scripts are available at https://github.com/cligs/scripts-nh/tree/master/corpus/derivative_formats. Previous versions can be viewed at the CLiGS group’s toolbox repository: https://github.com/cligs/toolbox/tree/master/annotate. Accessed August 24, 2020.
359. See chapter 3.3.2 above for details.
360. The list of regular expressions that was used is available at https://github.com/cligs/data-nh/blob/main/corpus/derivative-formats/verb-form-patterns-es-detail.txt. A list of exception words was prepared to cover cases where words that are not verb forms with enclitic pronouns are matched by the regular expressions. The exception list can be viewed at https://github.com/cligs/data-nh/blob/main/corpus/derivative-formats/verbs-enclitics-exceptions.txt. The exception list is not exhaustive but covers the most frequent cases. The Python script used to evaluate the corpus with regards to verb forms with enclitic pronouns is published at https://github.com/cligs/scripts-nh/blob/master/corpus/derivative_formats/verbs_enclitics.py. The resulting counts are accessible at https://github.com/cligs/data-nh/blob/main/corpus/derivative-formats/verbs_enclitics_in_files.csv. Accessed November 20, 2020.
361. A summary of the counts is available at https://github.com/cligs/data-nh/blob/main/corpus/derivative-formats/verbs-enclitics-freeling.csv. Accessed 20 November 2020.
362. For unambiguous cases, also the accents were corrected with the help of substitution rules. For example, “dábanle” is transformed to “daban” and “le”, and “hízose” is transformed to “hizo” and “se” (the accent is not needed anymore and would be incorrect on the freestanding verb form). The list of accent replacements in verb form endings is available at https://github.com/cligs/data-nh/blob/main/corpus/derivative-formats/verb-form-endings-accents.txt. Accessed November 20, 2020.
363. See footnote 360 above.
364. The decision to rely on the two infrastructures of GitHub and Zenodo.org for publishing the corpus is the result of the work in the junior research group CLiGS (Schöch et al. 2019, paras 36–38Schöch, Christof, José Calvo Tello, Ulrike Henny-Krahmer, and Stefanie Popp. 2019. “The CLiGS Textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in TEI XML.” Journal of the Text Encoding Initiative. Rolling Issue. https://doi.org/10.4000/jtei.2085.).
365. The metadata mentioned in the table was generated with the script “metadata.xsl” available at https://github.com/cligs/scripts-nh/blob/master/corpus/metadata.xsl. Accessed September 24, 2020.
366. In digital literary stylistics, the separation of different stylistic signals that correspond to literary categories (authorship, genre, epoch, etc.) on the textual level has repeatedly been a concern (see, for instance, Burrows 2002Burrows, John. 2002. “‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship.” Literary and Linguistic Computing 17 (3): 267–287. https://doi.org/10.1093/llc/17.3.267.; Schöch 2013Schöch, Christof. 2013. “Fine-tuning Stylometric Tools: Investigating Authorship and Genre in French Classical Theater.” In Digital Humanities 2013. Conference Abstracts, Lincoln, NE, USA, July 16–19, 383–386. Lincoln, NE, USA: University of Nebraska-Lincoln. https://web.archive.org/web/20230304104934/http://dh2013.unl.edu/abstracts/ab-270.html.). For an attempt to “neutralize” the authorial signal to be able to analyze genre style, see Calvo Tello et al. (2017Calvo Tello, José, Daniel Schlör, Ulrike Henny, and Christof Schöch. 2017. “Neutralizing the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels.” In Digital Humanities 2017. Conference Abstracts, Montréal, Canada, August 8–11, 2017, 181–184. Montreal: McGill University & Université de Montréal. https://web.archive.org/web/20230212053238/https://dh2017.adho.org/abstracts/037/037.pdf.). On the other hand, differences between linguistic and literary styles are usually not emphasized in linguistic textbooks on the topic, where a general coverage of stylistic phenomena is pursued. This is also due to revised definitions of style. As Sowinski notes: “In älteren Arbeiten zur Stilistik wird Stil ausschließlich literarischen Werken zugesprochen [...]. Erst in neuerer Zeit ist diese Einschränkung gefallen: Stil wird heute allen Texten zugesprochen, wenn auch in unterschiedlicher Art und Weise. [...] Auch die Gebrauchstexte pragmatischer Natur müssen nicht arm an Stilelementen sein, wenn hier auch oft andere Stilzüge (z.B. Ökonomie, Präzision) den Verzicht auf bestimmte Stilelemente, z.B. des affektiven oder bildhaften Bereichs bedingen können” (Sowinski 1999, 73Sowinski, Bernhard. 1999. Stilistik: Stiltheorien und Stilanalysen. Stuttgart: Metzler.). However, Sowinski does not discuss how the different elements of style relate to authorship or period. In their “revisited” definition of style, the digital humanists Herrmann, Schöch, and van Dalen-Oskam also strive to offer a generally applicable notion: “In our approach, style is not something unique to literary works; rather, every text has a certain kind of style.” They also have a general view on the issue of the relationship between style and other categories, but with a focus on literary ones: “In our definition, style can be associated with categories such as genre, epoch, author, and many more. In many cases, correlations between specific style markers or groups of style markers with these categories may be observed. What is more, even in the absence of conscious intentions, causal relationships may be hypothesized: genre can cause style (e.g., by means of conventions: form and themes), authors can cause style (e.g., by means of idiosyncrasies), theme and topic can cause style. The interpretability of style relative to categories such as authorship, literary genre, or literary period, is hence paramount. This means that any stylistic phenomenon can ultimately be considered the trace of or the index towards such categories [...]” (Herrmann, Schöch, and van Dalen-Oskam 2015, 46Herrmann, J. Berenike, Christof Schöch, and Karina van Dalen-Oskam. 2015. “Revisiting Style, a Key Concept in Literary Studies.” Journal of Literary Theory 9 (1): 25–52. https://doi.org/10.1515/jlt-2015-0003.). Nevertheless, in empirical digital literary studies on genre, authorship is usually the factor that interferes most. The importance of considering literary periods for corpus building in literary studies is stressed by Gemeinböck (2016, 37Gemeinböck, Iris. 2016. “Representativeness in corpora of literary texts: introducing the C18P project.” MATLIT: Materialidades da Literatura 4 (2): 29–48. https://doi.org/10.14195/2182-8830_4-2_2.). Therefore, the position taken here is that although general principles for corpus building and the assessment of representativeness are the same for linguistic and literary corpora (such as determining a population and a sampling frame and following certain sampling strategies), there are differences in the kind and in the relevance of categories to assess internal variability.
367. For instance, in Biber’s text on representativeness itself, studies of distributions of linguistic features are presented (Biber 1993a, 248–255Biber, Douglas. 1993a. “Representativeness in Corpus Design.” Literary and Linguistic Computing 8 (4): 243–257. https://web.archive.org/web/20230128095417/http://otipl.philol.msu.ru/media/biber930.pdf.).
368. Some of the first studies in this direction are, for example, Jannidis et al. (2018Jannidis, Fotis, Leonard Konle, Albin Zehe, Andreas Hotho, and Markus Krug. 2018. “Analysing Direct Speech in German Novels.” In DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts, Köln, 26.2.-2.3.2018, edited by Georg Vogeler, 114–118. Köln: Universität zu Köln. https://doi.org/10.5281/zenodo.4622454.), Schöch, Schlör, et al. (2016Schöch, Christof, Daniel Schlör, Stefanie Popp, Annelen Brunner, Ulrike Henny, and José Calvo Tello. 2016. “Straight Talk! Automatic Recognition of Direct Speech in Nineteenth-Century French Novels.” In Digital Humanities 2016: Conference Abstracts, 346–353. Kraków: Jagiellonian University and Paedagocial University. https://web.archive.org/web/20230325081511/https://dh2016.adho.org/abstracts/31.), and Jockers (2013, 118–153Jockers, Matthew L. 2013. Macroanalysis. Digital Methods & Literary History. Topics in the Digital Humanities. Urbana, Chicago, and Springfield: University of Illinois Press.). The first and second are concerned with the detection and distribution of direct speech in German and French nineteenth-century novels, respectively, and the last with the distribution of topics in nineteenth-century English language novels.
369. The script “overview.xsl” is available at https://github.com/cligs/scripts-nh/blob/master/corpus/overview.xsl. The various resulting files can be viewed at https://github.com/cligs/data-nh/tree/master/corpus/overview. Accessed September 1, 2020.
370. Here and in the following, the percentages are rounded numbers.
371. The number of works per author is shown in more detail in figure 94 in the appendix of figures.
372. Here the top 21 were chosen because they include all the authors with 6 or more novels from the bibliography and with 4 or more novels in the corpus without having to make a cut inside of a group of authors with the same amount of novels. Authors with the same amount of novels are ordered alphabetically by surname, so the top positions can only be compared roughly. To list all authors who share a position together would have resulted in too many names at the top, especially for the corpus.
373. I.e., inside the chronological limits of Bib-ACMé: 1830–1910.
374. In figure 95 in the appendix of figures, the number of editions per author is shown for both Bib-ACMé and Conha19. For Conha19, only editions of the works that are included in the corpus are counted.
375. As in the previous table for the authors with the most works, also here, the top positions are ordered first by the number of editions and then alphabetically so that the positions cannot be interpreted in a strict sequence. Here, 20 positions were chosen because they include all the authors with 8 or more novels from the corpus. For the bibliography, a cut-off had to be made inside the group of authors with 10 editions, leaving only Juana Manuela Gorriti in the table.
376. In figure 96 in the appendix of figures, the proportions of authors by country are displayed for both the bibliography and the corpus.
377. See the corresponding figures 97 to 99 in the appendix of figures.
378. The author gender proportions in Bib-ACMé and Conha19 are visualized in figure 100 in the appendix of figures.
379. See figure 101 in the appendix of figures.
380. In figure 102 in the appendix of figures, the number of births and deaths of the authors are visualized by decade.
381. See figure 103 in the appendix of figures. The figure displays the sums of authors who were not yet born, alive, and dead over the years. The death curves level off towards the end, probably a sign that the authors got older over time.
382. The sums of active authors per year are shown in figure 104 in the appendix of figures.
383. See figure 105 in the appendix of figures.
384. See figure 106 in the appendix of figures.
385. See the figures 107 to 109 in the appendix of figures.
386. In figure 108 in the appendix of figures, the percentages indicate the proportion of works contained in the corpus for each decade.
387. See figure 109 in the appendix of figures.
388. See figure 110 in the appendix of figures.
389. See figure 111 in the appendix of figures.
390. See figure 112 in the appendix of figures.
391. See chapter 3.3.3.1.6 (“Text Classification with Keywords”) above for details on how prestige was modeled.
392. See figure 113 in the appendix of figures for an overview of the proportions of high- and low-prestige novels by country.
393. See figures 114 (by decade) and 115 (before versus in or after 1880) in the appendix of figures.
394. The mean proportion of novels in the corpus compared to the bibliography is 31 %, and the 1900s are only represented with 20 %. See the overviews in the previous chapter.
395. Comparing the corpus to the bibliography, in the 1840s, 54 % of the works are covered.
396. See figure 116 in the appendix of figures.
397. See figures 117 and 118 in the appendix of figures.
398. See figure 119 in the appendix of figures. The novel with another setting is the science fiction novel “Viaje maravilloso del Señor Nic-Nac” (1875, AR) by Eduardo Holmberg, which tells an imagined trip of the protagonist to the planet Mars.
399. See figure 120 in the appendix of figures.
400. See figures 121 and 122 in the appendix of figures.
401. See chapter 3.3.3.1.6 (“Text Classification With Keywords”) above for details. The proportions of works set in the different time periods are visualized in figure 123 in the appendix of figures.
402. See figure 124 in the appendix of figures.
403. See figure 111 on the works by country in the appendix of figures, which was discussed in previous chapter.
404. See figures 125 and 126 in the appendix of figures.
405. The occupation with events that belong to the distant or recent past or are contemporary is not necessarily to be equated with the novels being historical novels or not. As Read states in his study of the Mexican historical novel: “It will be apparent as this study progresses that the works involved fall naturally into two groups, the romantic historical novels that deal with the conquest period and colonial times, and the novels that deal with historical events of the nineteenth century. The first group is essentially romantic, corresponding to the type developed by Walter Scott but with a distinctly local ‘middle ages’. Instead of turning to medieval Europe for exotic material, Mexican writers of this type of fiction sought out characters and institutions of their own dim past. Their hostility to the Spanish regime was still fresh enough to inspire them with a feeling of spiritual kinship with the Amerinds who had been the traditional enemies of the Europeans. [...] In this same group of romantic historical novels belong those fictional works that deal with colonial times. The Inquisition [...] is naturally the center of interest of these poetic interpretations of life in the colony. The second class of Mexican historical novels is that which finds its material in the history of the nineteenth century, the epoch in which the writers themselves had been actors in the dramas they presented. Such works may properly be called novels of contemporary history. Many of them were patterned after the Episodios nacionales of Pérez Galdós and the various historical romances of Erckmann-Chatrian. But though these two groups of works deal with materials from widely separated periods, they have much in common. Whatever the period involved, it was interpreted in terms of the ideals of the nineteenth century when Mexico was attempting to constitute itself a new nation [...]. Patriotism, a new sense of national identity and zeal for liberty and justice were the emotive forces that determined the trend of interpretation in both groups of historical novels to which attention has been called” (Read 1939, ix–xiRead, John Lloyd. 1939. The Mexican Historical Novel. 1826–1910. New York: Instituto de las Españas en los Estados Unidos.). Both types of novels also existed in Argentina: historical novels in the strict sense, which broached the issues of the Conquest and colonial times, and novels that treated contemporary historical events or those of the very recent past. Many of the early novels of the latter kind had the Rosas regime as their subject. Molina subsumes them under the group of political novels and calls them “novelas prospectivamente históricas” (Molina 2011, 246–249, 285–311Molina, Sintia. 2001. El Naturalismo en la novela cubana. Lanham, Maryland: University Press of America.). A contemporary setting was also predominant in the realist and naturalistic novels of the later nineteenth century: “With reference to Argentina, it is significant that the development of the realistic novel should coincide with the extraordinary growth and progress which the Republic manifested during the years 1880 to 1900. As this economic and material transformation took place, greatly affecting every facet of the nation’s life [...] eager writers sought to mirror that rapid change and portray the new society that was surging forth” (Lichtblau 1959, 138Lichtblau, Myron I. 1959. The Argentine Novel in the Nineteenth Century. New York: Hispanic Institute in the United States.). Regarding the Argentine naturalistic novel, Lichtblau remarks: “Not only did Argentina produce the first naturalistic novelist in Hispanic America in the figure of Cambaceres, but that country displayed as well the greatest over-all development of the naturalistic current in the nineteenth century. The tremendous material advancement, the great influx of immigrants, the changing social pattern, and the growing industrialization of the Republic–all these things writers used to advantage in applying the tenets of Zola to Argentine fiction.” (176–177Lichtblau, Myron I. 1959. The Argentine Novel in the Nineteenth Century. New York: Hispanic Institute in the United States.).
406. This is summarized in the box plot in figure 127 in the appendix of figures.
407. The numbers are rounded to the next thousand. The shortest novel is “Gubi Amaya” (1865, AR) by Juana Manuela Gorriti, and the longest one is the historical novel “El mendigo de San Ángel” (1865, MX) by Niceto de Zamacois. Surprisingly, the shortest and the longest novel of the corpus were first published in the same year. Despite the lower limit being 16,000 tokens, two novels with approximately 15,800 tokens, among them “Gubi Amaya”, were included because the number of tokens changed in the course of the preparation of the corpus, amongst other things, due to the correction of spelling errors. So in a strict sense, these two novels only fulfill the minimum length criterion when the number of tokens is rounded to the next thousand.
408. See figure 128 in the appendix of figures. Again, in the following, the numbers are rounded to the next thousand.
409. The script used for the significance tests is available at https://github.com/cligs/scripts-nh/blob/master/analysis/sign.py. Accessed January 3, 2021. Because the data is not normally distributed, the Mann-Whitney U test was used (instead of a t-test, for instance). The p-value that was calculated for the text lengths of Mexican versus Argentine novels is 0.001, for Mexican versus Cuban novels 0.04, and for Argentine versus Cuban novels 0.2, which means that there is no significant difference in the latter case.
410. See figure 129 in the appendix of figures.
411. See figure 110 (“Works by decade and country”) in the appendix of figures.
412. The script used for the calculation of significances and variance ratios is available at https://github.com/cligs/scripts-nh/blob/master/analysis/sign.py. Accessed January 3, 2021. The data is not normally distributed, so the Mann-Whitney U test was used. The following p-values resulted: 0.02 for 1860s versus 1870s, 0.003 for 1860s versus 1880s, and 0.045 for 1880s versus 1900s. The other constellations had p-values above 0.05. The 1830s and 1910s were not included in the calculations because there are only 2 and 3 works for these decades, respectively.
413. Calculating the ratio of text length variances for different pairs of decades shows that the differences in variance are biggest for the 1860s versus 1890s, the 1860s versus 1880s, and the 1860s versus 1890s. The difference in variance can be considered significant for the 1850s versus 1880s-1900s, the 1840s versus 1880s-1900s, and for all the constellations of the 1860s versus later decades. The differences in text length variance between the 1870s and the 1880s as well as the 1870s and 1890s are also significant, but the remaining ones are not. Variance ratios between 0.5 and 2.0 are considered similar, and values below 0.5 or above 2.0 as significantly different. The three biggest ratios of variance are 5.4 for the 1860s versus 1890s, 4.8 for the 1860s versus 1880s, and 4.0 for the 1850s versus 1890s. The 1830s and 1920s were not included in the calculations because of the low number of works in these two decades.
414. See the corresponding figure 130 in the appendix of figures.
415. See figures 131 to 133 in the appendix of figures.
416. See figure 134 for the proportions of editions by country and figure 135 for the number of editions published in different cities in Bib-ACMé and Conha19. Both figures can be found in the appendix of figures. In the figure on cities, only those that appear at least twice in the bibliography are included.
417. In a future analysis, it could be interesting to analyze if the presence or absence of the label “novela” corresponds to different subtypes of the genre.
418. See the corresponding figure 136, in which a series of donut charts is given, in the appendix of figures.
419. The analysis is based on the information if the works first published in the respective decades ever carried the label “novela” between 1830 and 1910 because the works are dated according to their first known edition, but their subgenre labels are collected for all the editions that were published in the chronological frame of this study. This introduces a certain fuzziness concerning the anchoring of subgenre labels in time, so the effect of change might even be stronger.
420. See chapter 3.2.3 above for details.
421. See figure 137 in the appendix of figures. The top 20 positions were calculated from the bibliography’s point of view.
422. See figure 138 in the appendix of figures.
423. See figure 139 in the appendix of figures.
424. That more implicit subgenre signals were found for the novels in the corpus also results from the fact that more paratextual information was evaluated for them (title pages, prefaces, etc.).
425. See the figures 140 and 141 in the appendix of figures.
426. See figures 142 and 143, which display the top 20 subgenre labels assigned to the works by literary historians.
427. See figures 144 and 145 in the appendix of figures. In figure 144, the number of different labels related to the realization of the discursive act is smaller than the sum of labels related to its three subgroups (“semantic”, “syntactic”, and “medium”) because the same label can be associated with several levels. For example, the labels “novela filosófica” and “novela psicológica” are categorized both as realization/semantic/theme and as realization/syntactic/mode.representation, because the terms point both to certain themes (e.g., general considerations about the meaning of life in a philosophical novel or the focus on personal, emotional states of characters in a psychological novel) and also to a certain way of representation (e.g., an argumentative style in philosophical novels and an introspective narrative style in psychological novels).
428. The visualization for the corpus is available at https://github.com/cligs/data-nh/blob/master/corpus/overview/subgenres-labels-number-corpus.html. Accessed 15 September 2020.
429. See figure 145 in the appendix of figures. When the overall number of labels was determined for the various categories, identical labels stemming from different kinds of sources were only counted once for each novel (e.g., if a novel was explicitly labeled as “novela histórica” and also classified as such by literary historians).
430. Again, on each level, if the same label is assigned to a work by different sources, it is only counted once.
431. The visualization for the overall number of labels in the different categories in the corpus is available at https://github.com/cligs/data-nh/blob/master/corpus/overview/subgenres-labels-amount-corpus.html. Accessed 15 September 2020.
432. To determine the overall number of labels, they were counted on each discursive level, so a homonymic label on different levels is counted several times. On the other hand, if several literary-historical sources mentioned the same label for a work, it was only counted once per level.
433. See the corresponding visualizations “subgenres-labels-number-explicit-bib.html”, “subgenres-labels-amount-explicit-bib.html”, “subgenres-labels-number-litHist-bib.html”, “subgenres-labels-amount-litHist-bib.html” at https://github.com/cligs/data-nh/tree/master/corpus/overview. Accessed 15 September 2020.
434. For the corpus, see the charts “subgenres-labels-number-explicit-corp.html”, “subgenres-labels-amount-explicit-corp.html”, “subgenres-labels-number-litHist-corp.html”, “subgenres-labels-amount-litHist-corp.html” at https://github.com/cligs/data-nh/tree/master/corpus/overview. Accessed 15 September 2020.
435. See, for instance, examples of statements on the relationship to the European literatures and questions of emancipation in Rössner’s literary history of Latin America. About nineteenth-century Caribbean literature, he writes: “Das Streben nach Unabhängigkeit ist nicht nur eine politische Angelegenheit, es bestimmt auch das literarische Leben. Einerseits orientieren sich die karibischen Literaten des 19. Jhs. an europäischen, besonders spanischen und französischen Vorbildern [...], andererseits bemühen sie sich darum, diese Modelle nicht einfach zu kopieren, sondern sich deren theoretisches Gedankengut mit Berücksichtigung all der Spezifika ihres amerikanischen Lebensraums kreativ anzuverwandeln” (Rössner 2007, 153Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.). In this context, there are also discussions of individual authors and works: “José López Portillo y Rojas hingegen distanziert sich von französischen Vorbildern. In dem Vorwort zu seinem Roman La parcela (1898) weist er die ästhetische Wortkunst Flauberts oder der Brüder Goncourt ebenso zurück, wie er auch die Obszönität Zolas meiden möchte. Stattdessen bezieht er sich auf die Spanier Galdós und vor allem Pereda, deren Vorliebe für das naturverbundene Leben in der Provinz auch in Portillos Bauernroman ihren thematischen Niederschlag findet. Die Reserve dem französischen Kulturerbe gegenüber hat sich auch in dem Roman Fuertes y débiles (1919) erhalten. Das hier porträtiert porfiristische Gesellschaftssystem krankt daran, dass sich der französische Positivismus nicht einfach auf die mexikanischen Verhältnisse übertragen lässt und an den espíritus débiles der científicos scheitert” (Rössner 2007, 148Rössner, Michael. 2007. Lateinamerikanische Literaturgeschichte. 3rd ed. Stuttgart, Weimar: J.B. Metzler.).
436. See figure 146 in the appendix of figures, which shows how many works have these 18 labels in Bib-ACMé and how many works there are in Conha19 with the same label.
437. See figure 147 in the appendix of figures. The different source types are stacked for each subgenre label to highlight their proportions, but the sums can be bigger than the ones of the number of works carrying the label because the same label can have various types of sources. This applies to all the following charts of subgenre label sources.
438. See figure 148 in the appendix of figures.
439. Such connections between different labels on the same level are listed in table 10 above (not exhaustively, but for some obvious cases).
440. Combinations of labels that are at the same time part of combinations of more labels are counted each time (the combination of “novela de costumbres” and “novela social”, for example, is also counted for works that have a combination of “novela de costumbres”, “novela gauchesca”, and “novela social”). Combinations with the same number of assignments are ordered alphabetically. The whole list of combinations is available at https://github.com/cligs/data-nh/blob/master/corpus/overview/subgenres-label-combinations-theme.csv. Accessed September 29, 2020.
441. See, for instance, Alpaydin (2016, 37–41Alpaydin, Ethem. 2016. Machine Learning: The New AI. Cambridge, Mass.: The MIT Press.) on the issue of generalization in supervised learning approaches. See also Branco, Torgo, and Ribeiro (2015Branco, Paula, Luís Torgo, and Rita P. Ribeiro. 2015. “A Survey of Predictive Modelling under Imbalanced Distributions.” ACM Computing Surveys 49 (2): 1–50. https://doi.org/10.1145/2907070.) on the challenges of imbalanced data distributions.
442. See figure 149 in the appendix of figures.
443. For overviews of multi-label learning methods, see Elkafrawy and Mausad (2015Elkafrawy, Passent, Amr Mausad, and Heba Esmail. 2015. “Experimental Comparison of Methods for Multi-Label Classification in Different Application Domains.” International Journal of Computer Applications 114 (19): 1–9. http://web.archive.org/web/20230513205320/https://research.ijcaonline.org/volume114/number19/pxc3901666.pdf.) and Madjarov et al. (2012Madjarov, Gjorji, Dragi Kocev, Dejan Gjorgjevikj, and Sašo Džeroski. 2012. “An extensive experimental comparison of methods for multi-label learning.” Pattern Recognition 45 (9): 3084–3104. https://doi.org/10.1016/j.patcog.2012.03.004.).
444. This combination is found in the subtitles of the novels “Calvario y Tabor. Novela histórica y de costumbres” (1868, MX) by Vicente Riva Palacio, “Julia. Novela histórica y de costumbres” (1868, MX) by Manuel Martínez de Castro, and “Astucia. Novela histórica de costumbres mexicanas” (1865–1866, MX) by Luis Gonzaga Inclán.
445. See figure 150 in the appendix of figures.
446. See figure 151 in the appendix of figures.
447. These are “Cecilia Valdés” (1882, CU) by Cirilo Villaverde, “Abismos” (1890, AR) by Manuel Bahamonde, “Amalia” (1891, MX) by José Rafael Guadalajara, “Los bandidos de Río Frío” (1892, MX) by Manuel Payno, “Angelina” (1893, MX), “Historia vulgar” (1904, MX), “La Calandria” (1890, MX), and “Los parientes ricos” (1901, MX) by Rafael Delgado. The novel “Abismos” is not part of the corpus.
448. These are again “Abismos” (1890, AR) by Manuel Bahamonde, “La Calandria” (1890, MX) by Rafael Delgado and “Los bandidos de Río Frío” (1892, MX) by Manuel Payno.
449. Writing about novels of customs, for example, Fernández-Arias Campoamor (1952, 56 Fernández-Arias Campoamor, José. 1952. Novelistas de Mejico: esquema de la historia de la novela mejicana (de Lizardi al 1950). Madrid: Ediciones Cultura Hispánica. ), states: “Los novelistas románticos que fueron costumbristas constituyen el puente tendido entre el romanticismo y el realismo [...] el costumbrismo como inclinación extensa y generalizada se inicia en el romanticismo”. However, as he subsumes the section “Costumbristas” under “El Romanticismo”, the novels mentioned in that chapter are interpreted as having been labeled as romantic novels by him here.
450. This was the case, for example, with the works of Rafael Delgado, which have been described as both romantic and realist (Gálvez 1990, 105–106Gálvez, Marina. 1990. La novela hispanoamericana (hasta 1940). Madrid: Taurus.; Varela Jácome [1982] 2000, sec. 2.1.3Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.).
451. See figure 152 in the appendix of figures.
452. There are even earlier cases of naturalistic novels: “La familia Unzúazu” (1868, CU) is the earliest novel that has been classified as “novela naturalista”.
453. Varela Jácome, for instance, comments on the delayed absorption of the romantic current in Spanish America and the chronological overlap of different literary currents: “La novela romántica no se aclimata en Hispanoamérica hasta el año 1846. Esto significa un claro asincronismo, con respecto a la narrativa de Europa y Estados Unidos, debido a la conflictividad ideológica y la carencia de modelos culturales idóneos. [...] Al margen de los obstáculos históricos, se produce, con una secuenciación discontinua, la introducción de modelos narrativos foráneos. [...] La novela indianista, iniciada muy temprano, en 1832, con Netzula, de Lafragua, se desarrolla en estratificación con los otros metagéneros románticos; en su época culminante, con Cumanda (1871), de Mera, se superpone sobre la narrativa de tendencia realista, a las últimas manifestaciones del ciclo coinciden, incluso, con la incorporación de las técnicas naturalistas” (Varela Jácome [1982] 2000, sec. 1.1.3Varela Jácome, Benito. (1982) 2000. Evolución de la novela hispanoamericana en el siglo XIX (en formato HTML). Alicante: Biblioteca Virtual Miguel de Cervantes. https://www.cervantesvirtual.com/nd/ark:/59851/bmct14z8.).
454. Namely “Camila o la tiranía de Juan Manuel de Rosas” (AR, 1836) by Agustín Fontanella, “Sepulcros blanqueados” (MX, 1836) by Juan Antonio Mateos, and “Lina Montalván, o el terremoto que destruyó el Callao y la ciudad de Lima en 1746” (AR, 1905) by José Victoriano Cabral.
455. The earliest novel classified as novela realista is “El negro Francisco” (1839, CU) by Antonio Zambrana y Vázquez. Even if this is an outlier, also the second and third quartiles of realist novels lie within the scope of the romantic novels, as the plot in the appendix of figures shows.
456. See figure 153 in the appendix of figures.
457. See figure 154 in the appendix of figures.
458. Combinations of labels that are at the same time part of combinations of more labels are counted each time (the combination “novela-memorias”, for example, is also counted for works that have “novela-episodios-memorias”). The whole list of combinations is available at https://github.com/cligs/data-nh/blob/master/corpus/overview/subgenres-label-combinations-mode-representation.csv. Accessed September 29, 2020.
459. See figure 155 in the appendix of figures.
460. See figure 156 in the appendix of figures.
461. See figure 157 in the appendix of figures.
462. In the case of the “novela mexicana”, it cannot be decided if it refers to the country or the capital.
463. According to the “Diccionario de la lengua española” of the Spanish Royal Academy, “suriana” means “coming from the south of Mexico” (Real Academia Española (RAE) 2023bReal Academia Española (RAE). 2023b. “suriano, na.” Diccionario de la lengua española (DLe). http://web.archive.org/web/20230601151138/https://dle.rae.es/suriano.).
464. “Coming from Guadalajara” (Real Academia Española (RAE) 2023cReal Academia Española (RAE). 2023c. “tapatío, a.” Diccionario de la lengua española (DLe). http://web.archive.org/web/20230601151450/https://dle.rae.es/tapat%C3%ADo.).
465. See figure 158 in the appendix of figures.
466. See the file https://github.com/cligs/data-nh/blob/master/corpus/overview/subgenres-label-combinations-identity.csv, which lists the combinations of identity labels in Bib-ACMé and Conha19. Accessed September 29 2020. Combinations of labels that are at the same time part of combinations of more labels are counted each time (the combination “novela cubana-novela original”, for example, is also counted for the work that has the combination “novela cubana-novela original-novela regional”).
467. The labels “novela kantabro-americana” and “novela franco-argentina” are assigned to the group of the American context.
468. The label “novela franco-argentina” is assigned to the group of Argentine novels.
469. See figure 159 in the appendix of figures.
470. See figure 160 in the appendix of figures.
471. See figure 161 in the appendix of figures.
472. The number of works associated with them as well as the kind of sources for the labels that refer to the attitude of the author or narrator, are visualized in figures 162 and 163 in the appendix of figures.
473. The number of works per label, as well as the sources of the labels, are given in figures 164 and 165 in the appendix of figures.
474. This observation is supported by figure 166 in the appendix of figures, which summarizes how many labels are assigned to how many works.
475. These are: “novela”, “novela romántica”, “novela sentimental”, “novela histórica”, “novela social”, “novela de costumbres”, “novela realista”, “novela original”, “novela naturalista”, “novela mexicana”, “memorias”, “episodios”, and “novela política”. Complete lists of the different subgenre terms in Bib-ACMé and Conha19 and the number of works to which they are assigned are available at https://github.com/cligs/data-nh/blob/master/corpus/overview/subgenres-works-per-label-bib.csv and https://github.com/cligs/data-nh/blob/master/corpus/overview/subgenres-works-per-label-corp.csv. Accessed October 1, 2020.
476. See figure 167 in the appendix of figures for a visualization of these proportions.
477. See figure 168 in the appendix of figures.
478. See figures 169 and 170 in the appendix of figures.
479. See figures 171 for the bibliography and 172 for the corpus in the appendix of figures. The percentages in the two figures indicate the proportion of novels of a certain subgenre in comparison to all the novels in the respective period.
480. See figure 173 in the appendix of figures, in which the proportions of primary thematic subgenres are displayed for high- and low-prestige novels.
481. See figure 174 in the appendix of figures.
482. See figure 175 in the appendix of figures.
483. See figure 176 in the appendix of figures. For the figure, the time period was evaluated in relationship to the year of the first known publication of the novels. For details on how the time period was determined, see chapter 3.3.3.1.6 (“Text Classification with Keywords”) above.
484. See figure 177 in the appendix of figures.
485. The numbers are rounded to the next thousand.
486. The script that was used for the significance tests and the calculation of variances is available at https://github.com/cligs/scripts-nh/blob/master/analysis/sign.py. Accessed January 3, 2021. As the data is not normally distributed, the Mann-Whitney U test was used to check for the statistical significance of the different text lengths. The p-value for novela histórica versus novela sentimental is 2.8e-06, for novela histórica versus novela de costumbres 0.002, for novela histórica versus novela social 0.0005, and for novela histórica versus novela política 0.02. For the novela sentimental versus novela social, the p-value is at the limit of significance with 0.047, but for all the other combinations of subgenres, the p-value is higher than 0.05.
487. The following ratios of variance were calculated: 5.9 for novela histórica versus novela política, 5.2 for novela histórica versus novela social, 4.1 for novela histórica versus novela sentimental, and 0.4 for novela sentimental versus novela de costumbres.
488. The general proportions of primary literary currents in Conha19 and Bib-ACMé are visualized in figure 178 in the appendix of figures.
489. The known instances of romantic, realist, naturalistic, and modernist novels could be used to determine the literary current of the 55 novels for which this information is still lacking by means of text classification.
490. See figure 179 in the appendix of figures.
491. See figures 180 and 181 in the appendix of figures.
492. See figures 182 and 183 in the appendix of figures.
493. See figure 184 in the appendix of figures for a visualization of these proportions.
494. See chapter 3.3.3.1.6, where the collection of prestige metadata for the novels in the corpus is outlined.
495. See figure 185 in the appendix of figures.
496. In the corpus, the realist novels which have a first-person narrator are the four novels of the Mexican writer Emilio Rabasa: “La bola” (1887), “La gran ciencia” (1887), “El cuarto poder” (1888)”, and “Moneda falsa” (1888), furthermore the sentimental realist novel “Angelina” (1893, MX) by Rafael Delgado, the historical realist novel “Las ranas pidiendo rey. Confesiones de una afrancesada (1861–1862)” (1903, MX) by Victoriano Salado Álvarez, and the novels “La gran aldea” (1884, AR) by Lucio Vicente López, “Mi tío el empleado” (1887, CU) by Ramón Meza, “Don Perfecto” (1902, AR) by Carlos María Ocantos, and “Divertidas aventuras del nieto de Juan Moreira” (1910, AR) by Roberto Payró.
497. See figure 186 in the appendix of figures.
498. The only novel that is neither set in America nor Europe is a science fiction novel set on the planet Mars.
499. See figure 187 in the appendix of figures.
500. See figure 188 in the appendix of figures.
501. The numbers are rounded to the next thousand.
502. The python script that was used for the significance tests is available at https://github.com/cligs/scripts-nh/blob/master/analysis/sign.py. Accessed January 3, 2021. The data is not normally distributed, so the Mann-Whitney U test was used. The p-value for novela romántica versus novela realista is 0.22, for novela romántica versus novela naturalista 0.16, and for novela realista versus novela naturalista 0.40.
503. The ratio of the variance between romantic and realist novels is 3.7, between romantic and naturalistic novels 2.3, and between realist and naturalistic novels 0.6.
504. If higher amounts of the most frequent items are chosen, also the general features include tokens that are content words, which have more specific semantic properties than function words. This is an effect of the size of the feature space rather than due to preliminary considerations regarding the semantics of the features.
505. See the presentation of the different subgenres in chapter 2.3.
506. An overview of different disciplinary approaches to the analysis of themes and topics is given in Anz (2007Anz, Thomas. 2007. “Inhaltsanalyse.” In Handbuch Literaturwissenschaft. Gegenstände – Konzepte – Institutionen, edited by Thomas Anz, vol. 2, Methoden und Theorien, 55–69. Stuttgart, Weimar: J.B. Metzler.).
507. See chapter 2.2 for the working definitions of literary style used in this study.
508. See, for example, first approaches to model literary space conducted by Barth and Viehhauser (2017Barth, Florian, and Gabriel Viehhauser. 2017. “Digitale Modellierung literarischen Raums.” In DHd2017. Konferenzabstracts. https://doi.org/10.5281/zenodo.4622732.).
509. See chapter 3.3.2 on text treatment.
510. The sentence is taken from the novel “Lucía Miranda” (1860, AR) by Eduarda Mansilla de García.
511. Underscores are used to represent blank spaces to enhance the readability of the n-grams.
512. For details on how these are implemented, see Sapkota et al. (2015, 94–95Sapkota, Upendra, Steven Bethard, Manuel Montes-y-Gómez, and Thamar Solorio. 2015. “Not All Character N-grams Are Created Equal: A Study in Authorship Attribution.” In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 93–102. Denver, Colorado: Association for Computational Linguistics. http://dx.doi.org/10.3115/v1/N15-1010.).
513. With “single-domain”, they mean that the corpus consists of texts about a single topic written by different authors. “Cross-domain” means a corpus with multiple different topics.
514. “Whole-word” was not selected because the general bag-of-words approach already covers this feature type. Of the affix-like features, “space-prefix” and “space-suffix” were not selected because they are congruent with “prefix” and “suffix”, the only difference being that they are one character shorter because they start or end with a blank space. “Suffix” was not selected because, in the Spanish language, the suffixes are highly influenced by how pronouns are used (if they are used freely before or attached to verb forms after them). As has been shown in the chapter on text treatment above (see 3.3.2), the difference in pronoun use is connected to the type of edition and its publication year, so it cannot be reliably associated with the subgenres. Of the three punctuation-based n-gram types, only “end-punct” was used because it is the only one capturing phrase-level structures of the sentences.
515. The source of this and the other subtype definitions in the table is Sapkota et al. 2015, 94f.
516. See the overviews on text length in chapter 4.1.3.2.
517. The 1 that is added to the idf secures that terms that occur in all documents will not be ignored (Scikit-learn developers 2007–2023gScikit-learn developers. 2007–2023g. “sklearn.feature_extraction.text.TfidfTransformer.” Scikit-learn. https://web.archive.org/web/20230304130653/https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html.).
518. The common abbreviation “MFW” is used in the general sense of “most frequent items” here and also includes the most frequent n-grams. The features were generated with a Python script available at https://github.com/cligs/scripts-nh/blob/master/features/general_features.py. The resulting feature sets are available at https://github.com/cligs/data-nh/tree/master/analysis/features/mfw. Both links were accessed on November 4, 2020.
519. See chapter 3.3.5 for details about the linguistic annotation. The annotated TEI files were used in their corrected form and are available at https://github.com/cligs/conha19/tree/master/annotated_corr and the corresponding full-text files at https://github.com/cligs/conha19/tree/master/txt_annotated_corr. Accessed December 14, 2020.
520. See chapter 3.3.2 about the text treatment. The resulting stop word list is available at https://github.com/cligs/data-nh/blob/master/analysis/features/stopwords/mfw_stopwords.txt. Accessed November 4, 2020.
521. Per default, the
CountVectorizer uses the token pattern (?u)\b\w\w+\b
, which looks for sequences of word characters
separated by word boundaries. For the Spanish texts in the corpus, the
pattern was slightly modified to also cover words consisting of just one
character, such as “y” or “a”: (?u)\b\w+\b
.
522. The special subtypes of character n-gram units were created with a Python approach designed specifically for this purpose (see the link to the script in footnote 518) because the CountVectorizer only supports the general character n-grams that include all types of characters.
523. For this purpose, also the script https://github.com/cligs/scripts-nh/blob/master/features/general_features.py was used. The resulting plots are available at https://github.com/cligs/data-nh/tree/master/analysis/features/mfw/overviews. Accessed November 4, 2020.
524. Random forest classifiers, for example, tend not to work very well with high-dimensional sparse data. Support Vector Machines (SVM), on the other hand, usually handle these well but work better when features have similar scales (Müller and Guido 2016, 90, 103–106Müller, Andreas C., and Sarah Guido. 2016. Introduction to Machine Learning with Python: a Guide for Data Scientists. Sebastopol, CA: O’Reilly.).
525. This chart only shows the feature sets based on single words. Charts displaying the proportions of zero values in the feature sets based on other token units (word ngrams and character ngrams) are available on GitHub. See footnote 523 above.
526. For details, see the overviews at https://github.com/cligs/data-nh/tree/master/analysis/features/mfw/overviews. Accessed November 4, 2020.
527. The last number or decimal of the values was rounded.
528. See, for instance, the remarks made on the scikit-learn website on how different feature variances influence the selection of coefficients in linear models (Scikit-learn developers 2007–2023cScikit-learn developers. 2007–2023c. “Common pitfalls in interpretation of coefficients of linear models.” Scikit-learn. https://web.archive.org/web/20230304130148/https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html.).
529. The method was initially developed in the context of Information Retrieval as a general approach to model text corpora: “The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgements” (Blei, Ng, and Jordan 2003, 993Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent dirichlet allocation.” Journal of Machine Learning Research 3: 993–1022. https://web.archive.org/web/20230310095853/https://dl.acm.org/doi/pdf/10.5555/944919.944937.).
530. The cited version of the paper is a reprint of Firth 1957. The wider context of this quote is: “The placing of a text as a constituent in a context of situation contributes to the statement of meaning since situations are set up to recognize use. As Wittgenstein says, ‘the meaning of words lies in their use.’ [...] The day-to-day practice of playing language games recognizes customs and rules. It follows that a text in such established usage may contain sentences such as ‘Don’t be such an ass!’, ‘You silly ass!’, ‘What an ass he is!’ In these examples, the word ass is in familiar and habitual company, commonly collocated with you silly—, he is a silly—, don’t be such an—. You shall know a word by the company it keeps! One of the meanings of ass is its habitual collocation with such other words as those above quoted” (Firth 1968, 179Firth, John Rupert. 1968. “A synopsis of linguistic theory 1930–1955.” In Selected Papers of J. R. Firth 1952–1959, edited by F. R. Palmer, 168–205. Bloomington, London: Indiana University Press.).
531. For a general discussion of the foundations of the distributional hypothesis, see Sahlgren 2008Sahlgren, Magnus. 2008. “The Distributional Hypothesis.” Rivista di Linguistica 20 (1): 33–53. https://web.archive.org/web/20230310101109/https://www.italian-journal-linguistics.com/app/uploads/2021/05/Sahlgren-1.pdf..
532. What exactly the “context” is and which scope it has depends on the implementation of the distributional analysis. The two main approaches are to either use surrounding words or to use text regions in which the words occur together as a basis. Current implementations of topic modeling, for instance, Latent Dirichlet Allocation (LDA), usually follow the latter strategy (Sahlgren 2008, 33Sahlgren, Magnus. 2008. “The Distributional Hypothesis.” Rivista di Linguistica 20 (1): 33–53. https://web.archive.org/web/20230310101109/https://www.italian-journal-linguistics.com/app/uploads/2021/05/Sahlgren-1.pdf.). Depending on the kind and size of the context, Sahlgren differentiates between two types of semantic similarity. According to him, a wider, document-oriented context leads to models that capture “semantic relatedness (e.g. ‘boat’ - ‘water’)”, while a narrower, word-oriented context models “semantic similarity (e.g. ‘boat’ - ‘ship’)” (Sahlgren 2015Sahlgren, Magnus. 2015. “A brief history of word embeddings (and some clarifications).” Linked in. https://web.archive.org/web/20230310102225/https://www.linkedin.com/pulse/brief-history-word-embeddings-some-clarifications-magnus-sahlgren.). As topic modeling is document-oriented, topics are characterized by semantic relatedness.
533. So far, the decisions for a certain length of the documentsIn-1 follow empirical experience or rules of thumb, because there are no theoretical foundations for this parameter of text preprocessing for topic modeling yet. In a survey on LDA-based topic modeling in Digital Humanities, Du notes: “The common preprocessing procedures include lemmatization, part-of-speech (POS) tagging and document chunking. [...] Chunking allows us to capture topics which only appear at certain points. My survey pays particular attention to the reasons of applying (or not) a preprocessing procedure in practice. [...] Document chunking is very diverse: the chunk-size could be several hundred or several thousand words, or a page of a book, or to split a book into ten equal segments. But almost no approach explained the reason of their chunking choices” (Du 2019Du, Keli. 2019. “A Survey on LDA Topic Modeling in Digital Humanities.” In Proceedings of DH2019: ‘Complexities’. Utrecht: Utrecht University. https://web.archive.org/web/20220121042220/https://dev.clariah.nl/files/dh2019/boa/0326.html.).
534. This is the so-called “term-document matrix”, “a data structure, a computationally tractable (to use McCarty’s term) representation of the texts able to be modeled by a computational process” (Burton 2013Burton, Matt. 2013. “The Joy of Topic Modeling.” May 21, 2013. https://web.archive.org/web/20211012091043/http://mcburton.net/blog/joy-of-tm/.).
535. Here, “vocabulary” means the set of different words contained in the corpus.
536. The example topic model is available at https://github.com/hennyu/papers/tree/master/family_resemblance_dsrom19/features/topicmodel. Accessed November 12, 2020. It was created with the following parameters: 100 topics, 5,000 iterations, and a hyperparameter optimization interval of 100. The texts were preprocessed by lemmatizing them and only using nouns. The original input full-text files were chunked into segments with a length of 1,000 tokens. A list of stop words was applied, which contained the 50 most frequent nouns plus some nouns that were added manually. The structure of the table corresponds to the examples shown in Steyvers and Griffiths (2007, 2Steyvers, Mark, and Tom Griffiths. 2007. “Probabilistic Topic Models.” In Handbook of latent semantic analysis, edited by Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch, 427–448. Mahwah, NJ: Lawrence Erlbaum Associates. https://web.archive.org/web/20220927113904/https://cocosci.princeton.edu/tom/papers/SteyversGriffiths.pdf.).
537. Word embeddings are another method from the area of distributional semantics. In word embeddings, words from a vocabulary are converted to vectors of real numbers, and semantic relationships can be inferred from the proximity or distance of the word vectors to each other (Sahlgren 2015Sahlgren, Magnus. 2015. “A brief history of word embeddings (and some clarifications).” Linked in. https://web.archive.org/web/20230310102225/https://www.linkedin.com/pulse/brief-history-word-embeddings-some-clarifications-magnus-sahlgren.).
538. Word nets are lexical databases of semantic relations between words, for example, synonymy or hyponymy (Miller 1995Miller, George A. 1995. “WordNet: A Lexical Database for English.” Communications of the ACM 38 (11): 39–41. https://doi.org/10.1145/219717.219748.).
539. This information about token counts per topic can be found in the diagnostics file of the tool MALLET, which was used to create the model (McCallum 2018aMcCallum, Andrew. 2018a. “Topic model diagnostics.” MALLET: A Machine Learning for Language Toolkit. https://web.archive.org/web/20200221035417/https://mallet.cs.umass.edu/diagnostics.php.).
540. See footnote 536 above for information about the underlying topic model. In the table, the topic probability values are rounded.
541. A forerunner of LDA is, for example, Latent Semantic Analysis (LSA), which is not probabilistic (Landauer and Dumais 1997Landauer, Thomas K., and Susan T. Dumais. 1997. “A solution to Plato’s problem: the Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge.” Psychological Review 104: 211–240. https://psycnet.apa.org/doi/10.1037/0033-295X.104.2.211.; Landauer, Foltz, and Laham 1998Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. 1998. “Introduction to latent semantic analysis.” Discourse Processes 25: 259–284. https://doi.org/10.1080/01638539809545028.). Newer approaches are Nonnegative Matrix Factorization (NMF) and a network-based approach using a stochastic block model (SBM) (Arora, Ge, and Moitra 2012Arora, Sanjeev, Rong Ge, and Ankur Moitra. 2012. “Learning Topic Models – Going beyond SVD.” In Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 1–10. Washington D.C.: IEEE Computer Society. https://doi.ieeecomputersociety.org/10.1109/FOCS.2012.49.; Gerlach, Peixoto, and Altmann 2018Gerlach, Martin, Tiago P. Peixoto, and Eduardo G. Altmann. 2018. “A network approach to topic models.” Science Advances 4 (7): 1–11. https://dx.doi.org/10.1126/sciadv.aaq1360.). In the context of deep learning, the method lda2vec has been developed, which combines the learning of word, document, and topic vectors (Moody 2016Moody, Christopher E. 2016. “Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.” arXiv.org. https://doi.org/10.48550/arXiv.1605.02019.).
542. This is the Dirichlet distribution which gives the algorithm its name (Blei, Ng, and Jordan 2003, 996–997Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent dirichlet allocation.” Journal of Machine Learning Research 3: 993–1022. https://web.archive.org/web/20230310095853/https://dl.acm.org/doi/pdf/10.5555/944919.944937.).
543. This process is called Gibbs sampling (Steyvers and Griffiths 2007, 7–9Steyvers, Mark, and Tom Griffiths. 2007. “Probabilistic Topic Models.” In Handbook of latent semantic analysis, edited by Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch, 427–448. Mahwah, NJ: Lawrence Erlbaum Associates. https://web.archive.org/web/20220927113904/https://cocosci.princeton.edu/tom/papers/SteyversGriffiths.pdf.).
544. The index TM is used here to differentiate the general term topic from the term as it is defined in the context of the topic modeling method.
545. For a useful discussion of the similarities and differences of topicsTM to literary theoretical terms on thematic aspects, see Horstmann (2018Horstmann, Jan. 2018. “Topic Modeling.” forTEXT. Literatur digital erforschen. https://web.archive.org/web/20230316111225/https://fortext.net/routinen/methoden/topic-modeling. ). Besides the term Thema, Horstmann also compares the German terms Motiv, Topos, and Sujet to topicsTM and comes to the conclusion that they all cover different thematic and content-related aspects of literary texts.
546. In German, the term Themenentfaltung is defined as “die gedankliche Ausführung des Themas” and, more specifically, “Die Entfaltung des Themas zum Gesamtinhalt des Textes kann als Verknüpfung bzw. Kombination relationaler, logisch-semantische definierter Kategorien beschrieben werden, welche die internen Beziehungen der in den einzelnen Textteilen (Überschrift, Abschnitten, Sätzen usw.) ausgedrückten Teilinhalte bzw. Teilthemen zum thematischen Kern des Textes (dem Textthema) angeben (z. B. Spezifizierung, Begründung usw.)” (Brinker, Cölfen, and Pappert 2014, 57Brinker, Klaus, Hermann Cölfen, and Steffen Pappert. 2014. Linguististische Textanalyse. Eine Einführung in Grundbegriffe und Methoden. 8th ed. Berlin: Erich Schmidt Verlag.). These concepts go back to Brinker (1992Brinker, Klaus. 1992. Textlinguistik. Heidelberg: Groos.).
547. For the abstract of the conference panel, see Willand et al. (2017Willand, Marcus, Peer Trilcke, Christof Schöch, Nanette Rißler-Pipka, Nils Reiter, and Frank Fischer. 2017. “Aktuelle Herausforderungen der Digitalen Dramenanalyse.” In DHd 2017. Digitale Nachhaltigkeit. Konferenzabstracts, 46–49. Bern: Universität Bern. https://doi.org/10.5281/zenodo.4622643.). The results of the contribution by Schöch and Rißler-Pipka about topic types can be found in the presentation, which can be downloaded at https://github.com/christofs/dramenanalyse-dhd/. Accessed November 14, 2020.
548. See, for example, the study on Argument Mining using topic modeling presented by Lawrence and Reed (2017Lawrence, John, and Chris Reed. 2017. “Mining Argumentative Structure from Natural Language text using Automatically Generated Premise-Conclusion Topic Models.” In Proceedings of the 4th Workshop on Argument Mining, 39–48. Copenhagen: Association for Computational Linguistics (ACL). http://dx.doi.org/10.18653/v1/W17-5105.).
549. The method has even been applied intentionally to discover non-thematic structures, for example, by Rhody: “By locating ‘figurative language’ as an aspect of address for topic modeling, I choose to constrain my consideration of poetic texts and agree to a caricature of poetry that hyper-focuses on its figurative aspects so that we can better understand how topic modeling, a methodology that deals with language at the level of word and document, can be leveraged to identify latent patterns in poetic discourse” (Rhody 2012Rhody, Lisa M. 2012. “Topic Modeling and Figurative Language.” Journal of Digital Humanities 2 (1). https://web.archive.org/web/20230316135657/https://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/.).
550. For instance, only noun lemmas are used in Hettinger et al. (2016 Hettinger, Lena, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2016. “Classification of Literary Subgenres.” In DHd2016. Modellierung – Vernetzung – Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts. Universität Leipzig 7. bis 12. März 2016, 160–164. Duisburg: nisaba verlag. https://doi.org/10.5281/zenodo.4645368.) and Schöch, Henny et al. (2016Schöch, Christof, Ulrike Henny, José Calvo Tello, Daniel Schlör, and Stefanie Popp. 2016. “Topic, Genre, Text. Topics im Textverlauf von Untergattungen des spanischen und hispanoamerikanischen Romans (1880–1930).” In DHd 2016. Modellierung, Vernetzung, Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts, 235–239. Leipzig: Universität Leipzig. https://doi.org/10.5281/zenodo.4645380.). Noun, verb, adjective, and adverb lemmas are used by Schöch (2017cSchöch, Christof. 2017c. “Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama.” Digital Humanities Quarterly 11 (2). https://web.archive.org/web/20230211105751/http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html.).
551. MALLET was used in version 2.0.8RC3.
552. He recommends setting the optimization interval depending on the goal of the analysis: “If your goal is to identify small numbers of texts about specific themes in a large collection, then a lot of optimization may be good. However, if your goal is to identify topics typical of certain authors, periods, genres or some other reasonably large subset of your collection, then it may be better to optimize a bit less” (Schöch 2016Schöch, Christof. 2016. “Topic Modeling with MALLET: Hyperparameter Optimization.” The Dragonfly’s Gaze. https://web.archive.org/web/20230316145457/https://dragonfly.hypotheses.org/1051.).
553. The script for creating the noun lemma files is available at https://github.com/cligs/scripts-nh/blob/master/corpus/derivative_formats/get-plaintext-annotated-nouns.xsl. The resulting plain-text files are available at https://github.com/cligs/conha19/tree/master/txt_annotated_nouns. Accessed November 15, 2020.
554. The proper names and place names in the stopword list can cover cases that were missed by the named entity recognition. The stopword list for the topic models is available at https://github.com/cligs/data-nh/blob/master/analysis/features/stopwords/topics_stopwords.txt. Accessed November 14, 2020. The list was refined by inspecting selected topic models and adding words with very general meanings that dominated some topics and reduced their interpretability.
555. The tool was presented at a workshop at the DH2017 conference in Montréal by the CLiGS team (Betz et al. 2017Betz, Katrin, Ulrike Henny-Krahmer, Christian Pölitz, Christof Schöch, and Albin Zehe. 2017. “The ‘topic modeling workflow (tmw)’.” Talk presented at the workshop ‘Let’s Develop an Infrastructure for Historical Research Tools’ at the conference DH2017, Montreal, Canada. Accessed November 14, 2020. https://christofs.github.io/tmw-dh/#/.). A more recent presentation about topic modeling that includes a description of tmw is Schöch (2019Schöch, Christof. 2019. “Distributional Semantics and Topic Modeling: Theory and Application.” Workshop given at the Baltic Summer School of Digital Humanties: Essentials of Coding and Encoding, Riga, July 2019. Accessed November 14, 2020. https://christofs.github.io/riga/#/.).
556. The version of tmw which was used here is available at https://github.com/cligs/scripts-nh/tree/master/features/tmw. Accessed November 14, 2020.
557. Tmw is even able to apply a smooth segmenting that respects paragraph boundaries, but this feature was not used here. Instead, a fixed segment length was defined.
558. The script that applies the different parameter settings, controls the topic modeling workflow, and calls the functions of tmw is available at https://github.com/cligs/scripts-nh/blob/master/features/topics.py. The resulting topic models are available at https://github.com/cligs/data-nh/tree/master/analysis/features/topics/. Accessed November 14, 2020.
559. The charts summarizing this finding can be viewed at https://github.com/cligs/data-nh/tree/main/analysis/features/topics/overviews. Accessed December 22, 2020.
560. See chapter 3.2.3 on the assignment of subgenre labels to the novels in the bibliography and chapter 3.3.4 for the novels in the corpus.
561. As explained below in chapter 4.2.2.2, the categories in the family resemblance network are built through community detection. A discussion of the challenges to evaluating network communities, including a proposition to use ground truth, can be found in Yang and Leskovec (2015Yang, Jaewon, and Jure Leskovec. 2015. “Defining and Evaluatign Network Communities based on ground-truth.” Knowledge and Information Systems 42: 181–213. https://doi.org/10.1007/s10115-013-0693-z.).
562. This description of machine learning is based on Alpaydin (2016, 1–3Alpaydin, Ethem. 2016. Machine Learning: The New AI. Cambridge, Mass.: The MIT Press.).
563. For a formal definition of the generalization problem, see Alpaydin (2016, 23–27, 37–41Alpaydin, Ethem. 2016. Machine Learning: The New AI. Cambridge, Mass.: The MIT Press.).
564. See chapter 4.2.1 on the chosen feature sets for details.
565. It would also be possible to use more specific types of features, for example, temporal expressions (including dates or words related to duration or repetitive events). By using such special features, the textual material is reduced much more, making it more difficult for classifiers to optimize a model for the differences between subgenres. As a consequence, the hypotheses about the relevance of these features for genre classification would have to be much stronger, for example, based on assumptions about how temporal expressions are used in historical novels versus other subgenres, and the classification could serve to confirm or reject the hypotheses.
566. For an overview of seven important and popular families of classification algorithms, see Müller and Guido (2016, 31–121Müller, Andreas C., and Sarah Guido. 2016. Introduction to Machine Learning with Python: a Guide for Data Scientists. Sebastopol, CA: O’Reilly.). In the Python library scikit-learn, there are 17 groups of classifiers (Scikit-learn developers 2007–2023l).
567. For the three algorithms, the Python implementations in scikit-learn were used (Scikit-learn developers 2007–2023k, 2007–2023j, 2007–2023e).
568. In the cited paper, it is discussed in-depth which effect the choice of different distance measures has on the results of authorship attribution tasks. Such a systematic investigation of the role that different distance measures have on genre classification has not been conducted yet.
569. Besides the feature weights, for the linear function, also an intercept or y-axis offset is learned.
570. This means that the minimum value for every feature is 0, and the maximum value is 1. The same kind of preprocessing of the data was used by Hettinger et al. (2016 Hettinger, Lena, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2016. “Classification of Literary Subgenres.” In DHd2016. Modellierung – Vernetzung – Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts. Universität Leipzig 7. bis 12. März 2016, 160–164. Duisburg: nisaba verlag. https://doi.org/10.5281/zenodo.4645368.). For the feature scaling, the function “scale_feature_sets()” in the script “classification.py” was used. See https://github.com/cligs/scripts-nh/blob/master/analysis/classification.py. The scaled feature sets are stored as new files in the same location as the original ones, and they have the additional term “MinMax” at the end of their filename. See https://github.com/cligs/data-nh/tree/master/analysis/features/mfw for the general features and https://github.com/cligs/data-nh/tree/master/analysis/features/topics for the topic features.
571. The results of
Random Forests can vary depending on the different random states that are
used. To make the classification results reproducible, here, the
random_state
parameter of the RF classifier is always set
to 0.
572. The main methods used to perform the
parameter study are the functions parameter_study()
and
evaluate_parameter_study()
, which are defined in the
Python script classification.py available at https://github.com/cligs/scripts-nh/blob/master/analysis/classification.py.
Accessed December 16, 2020. The results of all the grid searches for the
three classifiers were stored together in the files
grid-searches-KNN.csv, grid-searches-SVM.csv, and grid-searches-RF.csv,
which can be viewed at https://github.com/cligs/data-nh/tree/main/analysis/classification/parameter_study.
Accessed December 16, 2020.
573. The mean is calculated from the test scores of the ten splits from cross-validation.
574. In principle, it would also be possible to combine the two main types of feature sets. This was done, for example, by Hettinger et al. (2015 Hettinger, Lena, Martin Becker, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2015. “Genre Classification on German Novels.” In Proceedings of the 26th International Workshop on Database and Expert Systems Applications (DEXA), 249–253. Valencia. https://doi.org/10.1109/DEXA.2015.62.). In their approach to classifying German novels by subgenre, they experimented with combining feature sets to see if it improved their results. They used Principal Component Analysis (PCA) to make the sizes of the feature sets comparable and found out that some combinations improved the results, whereas others led to lower accuracy scores.
575. The implementation of the
classification tasks is included in the script classification.py, which
is available at https://github.com/cligs/scripts-nh/blob/master/analysis/classification.py.
Accessed January 9, 2021. The repetition of the data selection process
with undersampling for the bigger class is also implemented in that
script. For the cross-validation, scikit-learn’s function
cross_validate()
was used. With that function, all the
estimators that are trained for the different cross-validation runs can
be returned. This was done, and the results of each cross-validation run
were stored in CSV tables. The mean accuracy values and standard
deviations, which are discussed in the result section, are calculated
with an own python script based on the collected cv-runs in the CSV
tables. It was necessary to get the results for every estimator to be
able to analyze the feature importances and predictions of each one (Scikit-learn
developers 2007–2023hScikit-learn developers. 2007–2023h.
“sklearn.model_selection.cross_validate.” Scikit-learn.
https://web.archive.org/web/20230304130816/https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html.).
576. Another, more sophisticated way to analyze this would be to check the probabilities that the classifiers calculate for an instance to belong to the different classes. The implementations of KNN, SVM, and RF in scikit-learn include the possibility of inspecting the probability calculations, but for the SVM it is noted that these values may be inconsistent with the general prediction (Scikit-learn developers 2007–2023mScikit-learn developers. 2007–2023m. “Support Vector Machines, sec. Scores and probabilites.” Scikit-learn. https://web.archive.org/web/20230304123130/https://scikit-learn.org/stable/modules/svm.html.).
577. Analyzing feature importances for the three types of classifiers involves different concepts and techniques. For SVMs, the coefficients of the linear model can be interpreted (Scikit-learn developers 2007–2023kScikit-learn developers. 2007–2023k. “sklearn.svm.SVC.” Scikit-learn. https://web.archive.org/web/20230304131430/https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.). For RFs, the trained estimators in scikit-learn include a list of feature importances which is based on the average contribution of each feature to reduce misclassifications in all the different trees of the forest (Scikit-learn developers 2007–2023eScikit-learn developers. 2007–2023e. “sklearn.ensemble.RandomForestClassifier.” Scikit-learn. https://web.archive.org/web/20230304130404/https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.). Finally, for KNN, no feature importances are returned because the classification is made based on similarities between whole data points, which means that all the features of the neighboring points are decisive. So to interpret how important individual features were in this case, other external methods have to be used. One possibility is to use the feature set which worked best with KNN and to calculate how distinctive different features are for the classes in question. For example, this can be done by evaluating which features differ most from the mean values of the features in both the positive and negative class and ranking these features by their distinctiveness. If the full-texts are used as a basis (instead of the specific feature set), zeta-scores can be calculated for the features to find the ones that are distinctive for the classes. For an overview of the zeta-measure of distinctiveness and its variants, see Schöch, Schlör et al. (2018Schöch, Christof, Daniel Schlör, Albin Zehe, Henning Gebhard, Martin Becker, and Andreas Hotho. 2018. “Burrows’ Zeta: Exploring and Evaluating Variants and Parameters.” In Digital Humanities 2018. Puentes–Bridges. Book of Abstracts. Mexico City, 26–29 June 2018, 274–278. Mexico City: Red de Humanidades Digitales. https://web.archive.org/web/20230212045250/https://dh2018.adho.org/burrows-zeta-exploring-and-evaluating-variants-and-parameters/.).
578. See the presentation of thematic subgenres from a literary-historical point of view in chapter 2.3.1.
579. The classification was performed with the script https://github.com/cligs/scripts-nh/blob/master/analysis/classification.py. All the results are available at https://github.com/cligs/data-nh/tree/main/analysis/classification/themes. Accessed January 3, 2021. Tables with the results of all the classification runs can be found in the subfolder “results_data”, summaries in tabular form in “results_summaries”, and visualizations of results in “visuals”.
580. For some classification runs, no F1 scores were available. This happens if the test set only contains instances of one class (either the positive or the negative one) because then it is not possible to calculate precision and recall values. The missing F1 scores were just left out for the calculations of the means and standard deviations. For the calculation of accuracy values, in turn, this does not constitute a problem. The values in the table are rounded to two decimal places.
581. Accuracies that are mentioned in the text are rounded to two decimal places. Because the range between the lowest and highest mean accuracies is so small, different third or fourth decimal places lead to differences in the plots.
582. See also Schöch, who analyzes the effects that different optimization intervals have on the resulting topic models of collections of literary texts and who notes: “if your goal is to identify topics typical of certain authors, periods, genres or some other reasonably large subset of your collection, then it may be better to optimize a bit less. In any case, it seems to me that it is quite possible to do too much or too little optimization for a given task” (Schöch 2016Schöch, Christof. 2016. “Topic Modeling with MALLET: Hyperparameter Optimization.” The Dragonfly’s Gaze. https://web.archive.org/web/20230316145457/https://dragonfly.hypotheses.org/1051.).
583. Word clouds, which visualize the 40 top words of each topic in the model, are available at https://github.com/cligs/data-nh/tree/main/analysis/features/topics/5_visuals/90tp-5000it-250in-0/wordles. The file containing all the top words for the topics can be found at https://github.com/cligs/data-nh/blob/main/analysis/features/topics/3_models/topics-with-words_90tp-5000it-250in-0.csv and the topic probabilities per novel at https://github.com/cligs/data-nh/blob/main/analysis/features/topics/4_aggregates/90tp-5000it-250in-0/avgtopicscores_by-idno.csv. Accessed January 6, 2021.
584. The sign of the feature weights is determined by the SVM classifier and is not directly related to the order of the classes as first and second.
585. In the novel “Alma de niña” (nh0082), for instance, it occurs 58 times, and in the novel “Auras de Abril” (nh0233), 20 times. The spell-checking result files for these novels show that the spell-checker did not recognize this word as an error. See https://github.com/cligs/conha19/blob/main/spellcheck/results/spellcheck_nh0082.csv and https://github.com/cligs/conha19/blob/main/spellcheck/results/spellcheck_nh0233.csv. A plot showing the top 20 novels for this topic is available at https://github.com/cligs/data-nh/blob/main/analysis/features/topics/5_visuals/90tp-5000it-250in-0/topItems/idno/tI_by-idno-028.png. A January 7, 2021.
586. See the word clouds at https://github.com/cligs/data-nh/blob/main/analysis/features/topics/5_visuals/90tp-5000it-250in-0/wordles/wordle_tp089.png and https://github.com/cligs/data-nh/blob/main/analysis/features/topics/5_visuals/90tp-5000it-250in-0/wordles/wordle_tp035.png, respectively. Accessed January 7, 2020.
587. The three plots with the feature importances for novela histórica versus other novels, for novela sentimental versus other novels, and for novela de costumbres versus other novels can be seen at https://github.com/cligs/data-nh/tree/main/analysis/classification/themes/visuals and are called “feat_imp_SVM_90t_250oi_topic-rep_0_novela_histórica_other.html”, “feat_imp_SVM_90t_250oi_topic-rep_0_novela_sentimental_other.html”, and “feat_imp_SVM_90t_250oi_topic-rep_0_novela_de_costumbres_other.html”, respectively. Accessed January 7, 2020.
588. The charts about the classification results for individual novels are available in the “visuals” folder in the data-nh GitHub repository (see the previous footnote). The data was also collected in CSV files (called “misclassifications...”), which can be viewed in the folder https://github.com/cligs/data-nh/tree/main/analysis/classification/themes/results_summaries. Accessed January 7, 2020.
589. The probabilities are used in the form they have after the MinMax scaling, which sets the range of the individual features to [0,1]. This form is used here because it was the form of the features that were also used by the classifier.
590. The publication date in the corpus differs from the one mentioned by Brushwood because it refers to the year of the first book publication.
591. At least when compared to the “other” group.
592. Remos y Rubio mentions 1914 as the publication date, but he refers to the second part of the novel (or the novel with both parts).
593. Here only the plots for the contrast of individual subgenres with the “other” group were shown. Another possibility is to analyze the direct subgenre comparisons to investigate how two selected subgenres relate to each other in terms of prototypicality.
594. The different numbers of runs are due to the different amounts of parameter constellations for each feature type. For basic MFW, there is just one token unit (word); for word n-grams, there are three; and for character n-grams, nine. In addition, all of the ten repetitions with the different data samples and ten cross-validation runs per constellation are included.
595. The scores and standard deviations are rounded to two decimal places.
596. The results for tf-scores are not visible here because they are almost identical to those for z-scores because the latter are based on the tf-scores, and the scaling of the features is not decisive for the RF classifier.
597. See chapter 4.2.1.1 above.
598. Hettinger et al. (2015 Hettinger, Lena, Martin Becker, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2015. “Genre Classification on German Novels.” In Proceedings of the 26th International Workshop on Database and Expert Systems Applications (DEXA), 249–253. Valencia. https://doi.org/10.1109/DEXA.2015.62.) also found out that 3,000 MFW worked best for the classification of subgenres of the novel with SVM. This is astonishing because they classified nineteenth-century novels in German language and not Spanish. Here, also the plot for SVM was checked, and there the highest mean accuracies are at 3,000 and 4,000 MFW. As Hettinger et al. also analyzed mainly thematic subgenres (adventure novels, social novels, and education novels), it can be hypothesized that a range of 2,000 to 4,000 MFW is a good feature choice for the classification of thematic subgenres, independently of the language and classifier.
599. The default n-grams are provided by the CountVectorizer of scikit-learn. See chapter 4.2.1.1 on the MFW-based features for details.
600. Hettinger et al. (2015 Hettinger, Lena, Martin Becker, Isabella Reger, Fotis Jannidis, and Andreas Hotho. 2015. “Genre Classification on German Novels.” In Proceedings of the 26th International Workshop on Database and Expert Systems Applications (DEXA), 249–253. Valencia. https://doi.org/10.1109/DEXA.2015.62.) used the top 1,000 character 4-grams for the classification of German novels by subgenre.
601. See chapter 4.2.1.1 above, in which these special types of n-gram features are explained.
602. The scores and standard deviations are rounded to two decimal places.
603. For an overview of the characteristics of the romantic, realist, and naturalistic novels, see chapter 2.3.2.
604. This overview was generated with the function
plot_overview_literary_currents_primary()
in the
script https://github.com/cligs/scripts-nh/blob/master/analysis/classification.py.
The resulting chart is available at https://github.com/cligs/data-nh/blob/main/analysis/classification/literary-currents/visuals/overview-primary-currents-corp.html.
Accessed October 29, 2020. For more extensive overviews of the novels
by primary literary current, see chapter 4.1.5.3.2.
605. A future task for the literary currents is to use the classification models to determine the labels of the novels for which no indication of literary current could be found, neither through explicit or implicit paratextual signals nor via literary-historical publications.
606. The scores and standard deviations in the table are rounded to two decimal places.
607. Slight differences in the plot are due to the small range of the values. Here the scores are rounded to two decimal points.
608. The scores are rounded to two decimal points.
609. In the corpus, the novels “¿Inocentes o culpables?” (1884, AR) by Juan Antonio Argerich and “Los bandidos de Río Frío” (1892, MX) by Manuel Payno carry the label “novela naturalista” in their subtitles. About the discussion to treat the naturalistic novel as a subgenre or a movement, see Schlickers (2003, 16Schlickers, Sabine. 2003. El lado oscuro de la modernización: estudios sobre la novela naturalista hispanoamericana. Madrid, Frankfurt: Iberoamericana/Vervuert.).
610. Cosine similarity measures the cosine of the angle between two text vectors. See https://en.wikipedia.org/wiki/Cosine_similarity. Accessed April 28, 2020.
611. Because the closest neighborship depends on the perspective, it was calculated for each node. If two nodes are mutually closest, the strength of the edge increases.
612. For general information about community structures in networks, see https://en.wikipedia.org/wiki/Community_structure. Accessed April 28, 2020.
613. For a review of different community detection algorithms for networks, see Javed et al. (2018Javed, Muhammad Aqib, Muhammad Shahzad Younis, Siddique Latif, Junaid Qadir, and Adeel Baig. 2018. “Community detection in networks: A multidisciplinary review.” Journal of Network and Computer Applications 108: 87–111. https://doi.org/10.1016/j.jnca.2018.02.011.).
614. See https://github.com/taynaud/python-louvain for the Python module implementing the Louvain community detection. Accessed April 28, 2020. The algorithm itself is presented in Blondel et al. (2008Blondel, Vincent D., Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast unfolding of communities in large networks.” Journal of Statistical Mechanics: Theory and Experiment 10: 155–168. https://10.1088/1742-5468/2008/10/P10008.).
615. Of course, zero values are also possible in the feature matrices and could be interpreted as absent, but it would not be proportionate to consider all values that are greater than zero as present. A possibility to model the features in a different way would be to define a threshold value and convert all values below it to zero and all values above it to 1 to get a binary distinction. Good reasons would have to be given for the value at which to set the threshold.
616. From the many decisions taken so far to formalize the family resemblance concept, it becomes clear that variants of this approach are possible. For example, the similarity measure used, the number of nearest neighbors considered, the way to determine the strength of the edges, and the kind of community detection algorithm could be varied. As is generally the case with feature-based categorization, also the selection of the features and their modeling and parametrization are subject to choice. Further empirical studies and serial analyses are needed to test the effects that such variation has on the results.
617. The metadata of the subcorpus used for this analysis is available as “metadata.csv” in the folder “corpus_metadata” at https://github.com/cligs/data-nh/tree/main/analysis/family-resemblance/. All other data related to this family resemblance analysis, including results and figures, can be found in the same GitHub folder. The Python scripts used are available at https://github.com/cligs/scripts-nh/tree/master/analysis/family_resemblance. Accessed January 8, 2021.
618. This topic model was created separately from the ones used for the classification in the previous chapter because the work on the family resemblance network was done earlier. The topic model used here was built with the tool MALLET (McCallum 2002McCallum, Andrew. 2002. “MALLET: A Machine Learning for Language Toolkit.” Accessed November 13, 2020. http://mallet.cs.umass.edu.) and pre- and post-processed with tmw (Schöch and Schlör 2017Schöch, Christof, and Daniel Schlör. 2017. “tmw – Topic Modeling Workflow.” GitHub. Accessed November 14, 2020. https://github.com/cligs/tmw.). The texts were lemmatized with the TreeTagger (Schmid 1994Schmid, Helmut. 1994. “Probabilistic Part-of-Speech Tagging Using Decision Trees.” In Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK, 44–49. https://web.archive.org/web/20230603115230/https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf.), using the Spanish parameter file, and only nouns were kept. In addition, a list of stop words was prepared based on the 50 most frequent nouns and adapted manually. To this, some more stop words were added after inspecting the results of the topic model (e.g., proper names or very general nouns). Before running the topic modeling, the texts were first lemmatized and then segmented into chunks with a length of 1,000 tokens. Besides the number of topics, the topic model was created with 5,000 iterations. The feature matrices, both for the full and the reduced corpus, can be viewed on GitHub (see the previous footnote).
619. The script calling the various functions of the network analysis for the different setups is available at https://github.com/hennyu/papers/blob/master/family_resemblance_dsrom19/analysis/run_scripts.py. Accessed May 6, 2020.
620. The overall results can be inspected on GitHub, though.
621. In a strict sense, the topics are categories and not numerical values and should be visualized as bars rather than lines. The line plot was chosen here because it facilitates seeing the differences between the data series.
622. This website is not available anymore.
623. This website has broken links and does not seem to be supported anymore.
624. This website is not available anymore.
625. The digital repository of the Mexican government’s Secretary of Culture was relaunched during the preparation of this dissertation. The two novels in the corpus were taken from the old version of the repository and could be downloaded at http://dgb.conaculta.gob.mx/coleccion_sep and http://impresosmexicanos.conaculta.gob.mx/libros. These links are not accessible anymore. The new version of this digital repository whose link is given in the table is called “Mexicana. Repositorio del Patrimonio Cultural de México”.