Institut für Dokumentologie und Editorik

Genre Analysis and Corpus Design: Nineteenth-Century Spanish-American Novels (1830–1910)

 

Index of Figures

Figure 1: Relationships between text types, conventional genres, and textual genres.

Figure 2: Proportion of paragraphs containing direct speech, travelogues versus novels.

Figure 3: Number of words per page for a sample of 100 pages.

Figure 4: Number of words for the full texts of 129 works carrying the label “novela”.

Figure 5: Number of pages and words for the bibliographic entries of 252 works carrying the label “novela”.

Figure 6: Number of words for 381 works carrying the label “novela”.

Figure 7: Number of words for 65 works carrying the label “novela corta”.

Figure 8: Works by source. Left: candidates, right: entries in the bibliography.

Figure 9: Inclusion and reasons for exclusion of works.

Figure 10: Kinds of subgenres in the context of a discursive model.

Figure 11: Sources by institution.

Figure 12: Sources by file type and institution.

Figure 13: Sources by type of edition and type of institution.

Figure 14: Distribution of spelling errors without exception words.

Figure 15: Distribution of spelling errors without exception words (logarithmic scale).

Figure 16: Top 30 spelling errors.

Figure 17: Number of error tokens and types covered by exception lists.

Figure 18: Distribution of spelling errors with exception words.

Figure 19: Distribution of error tokens and types for the corpus files (absolute).

Figure 20: Distribution of error tokens and types for the corpus files (relative).

Figure 21: Distribution of error tokens and types for the corpus files (by type of source edition).

Figure 22: Distribution of error tokens and types for the corpus files (by source file type).

Figure 23: Distribution of error tokens and types for the corpus files (by source institution).

Figure 24: Death years of authors.

Figure 25: Years of the novels' first publications.

Figure 26: Publication years of basis editions.

Figure 27: Copyright statuses of the novels in the corpus.

Figure 28: Characterization of the direct speech annotated in the corpus.

Figure 29: Pages with direct speech from “Libro extraño” by Francisco Sicardi, with initial speech signs (left page) and without speech signs (right page).

Figure 30: Scores for direct speech recognition (gold standard versus regular expression approach).

Figure 31: F1 scores for direct speech recognition by kind of edition.

Figure 32: F1 scores for direct speech recognition by type of speech sign.

Figure 33: Verb forms with enclitic pronouns in the novels of the corpus.

Figure 34: FreeLing POS of verb forms with enclitic pronouns.

Figure 35: FreeLing POS of verb forms with enclitic pronouns in the texts of the corpus.

Figure 36: Proportions of zero values in MFW feature sets.

Figure 37: Distribution of zero values in MFW100.

Figure 38: Distribution of zero values in MFW5000.

Figure 39: Variances of the 1000 MFW (absolute values).

Figure 40: Variances of the 1000 MFW (tf-scores).

Figure 41: Variances of the 1000 MFW (tf-idf-scores).

Figure 42: Variances of the 1000 MFW (z-scores).

Figure 43: Mean coherence of the topic models with different parameter settings.

Figure 44: Example topics.

Figure 45: Frequency of rank 1 for different values of n_neighbors (KNN).

Figure 46: Frequency of rank 1 for different values of weights (KNN).

Figure 47: Frequency of rank 1 for different values of metric (KNN).

Figure 48: Frequency of rank 1 for different values of C (SVM).

Figure 49: Frequency of rank 1 for different values of max_features (RF).

Figure 50: Classification workflow.

Figure 51: Primary thematic subgenres in the corpus.

Figure 52: Classification results for topic feature sets (SVM, varying number of topics, and optimization intervals).

Figure 53: Feature weights (topics) for historical versus sentimental novels.

Figure 54: Most distinctive topics for historical versus sentimental novels.

Figure 55: Topics “v_d-instante-corazón” and “tía-do-aire”.

Figure 56: Feature weights (topics) for novels of customs versus historical novels.

Figure 57: Topics “mesa-puerta-sala” and “boca-cabeza-perro”.

Figure 58: Feature weights (topics) for novels of customs versus sentimental novels.

Figure 59: Predictions for novela histórica versus other novels (topics).

Figure 60: Top topics for novela histórica versus other novels in the novel “La cruz y la espada”.

Figure 61: Top topics for novela histórica versus other novels in the novel “Las gentes que son así”.

Figure 62: Top topics for novela histórica versus other novels in the novel “Los bandidos de Río Frío”.

Figure 63: Top topics for novela histórica versus other novels in the novel “Los esposos”.

Figure 64: Top topics for novela histórica versus other novels in the novel “Vía Crucis”.

Figure 65: Top topics for novela histórica versus other novels in the novel “Las ranas pidiendo rey”.

Figure 66: Predictions for novela sentimental versus other novels (topics).

Figure 67: Predictions for novela de costumbres versus other novels (topics).

Figure 68: Classification results for MFW feature sets (RF, varying number of MFW and normalization technique).

Figure 69: Classification results for word n-gram feature sets (RF, varying number of MFW, grams, and normalization technique).

Figure 70: Classification results for classic character n-gram feature sets (RF, varying number of MFW, grams, and normalization technique).

Figure 71: Classification results for “word” character n-gram feature sets (RF, varying number of MFW, grams, and normalization technique).

Figure 72: Classification results for "affix-punct" character n-gram features sets (RF, varying number of MFW, grams, and normalization technique).

Figure 73: Primary literary currents in the corpus.

Figure 74: Classification results for topic feature sets (SVM, varying number of topics and optimization intervals).

Figure 75: Classification results for MFW feature sets (SVM, varying number of MFW and normalization technique).

Figure 76: Classification results for word n-gram feature sets (SVM, varying number of MFW, grams, and normalization technique).

Figure 77: Classification results for classic character n-gram feature sets (SVM, varying number of MFW, grams, and normalization technique).

Figure 78: Classification results for “word” character n-gram feature sets (SVM, varying number of MFW, grams, and normalization technique).

Figure 79: Classification results for “affix-punct” character n-gram feature sets (SVM, varying number of MFW, grams, and normalization technique).

Figure 80: Feature weights (MFW) for realist versus romantic novels.

Figure 81: Feature weights (MFW) for naturalistic versus realist novels.

Figure 82: Predictions for novela romántica versus other novels (MFW).

Figure 83: Predictions for novela realista versus other novels (MFW).

Figure 84: Predictions for novela naturalista versus other novels (MFW).

Figure 85: Subcorpus for the family resemblance analysis.

Figure 86: Examples of topics for the family resemblance analysis.

Figure 87: Network of historical novels based on topics (HIST).

Figure 88: Overview of cluster metadata in the network HIST.

Figure 89: Clusters by year in the network HIST.

Figure 90: Topic scores for cluster 3 in the network HIST.

Figure 91: Top distinctive topics in the clusters of the network HIST.

Figure 92: Clusters by year in the network SENT.

Figure 93: Clusters by subgenre in the combined network.

Figure 94: Number of works per author.

Figure 95: Number of editions per author.

Figure 96: Authors by country.

Figure 97: Authors by nationality.

Figure 98: Authors by country of birth.

Figure 99: Authors by country of death.

Figure 100: Author gender.

Figure 101: Knowledge of the authors’ life dates.

Figure 102: Births and deaths of authors by decade.

Figure 103: Authors alive per year.

Figure 104: Number of active authors per year.

Figure 105: Author ages when publishing novels.

Figure 106: Authors’ age at death.

Figure 107: Number of works per year in Bib-ACMé and Conha19.

Figure 108: Works by decade in Bib-ACMé and Conha19.

Figure 109: Works before and after 1880.

Figure 110: Works by decade and country.

Figure 111: Works by country in Bib-ACMé and Conha19.

Figure 112: Publication countries of first editions.

Figure 113: High and low prestige novels by country.

Figure 114: High and low prestige novels by decade.

Figure 115: High and low prestige novels before and in or after 1880.

Figure 116: Narrative perspective by country.

Figure 117: Narrative perspective by decade.

Figure 118: Narrative perspective before and in or after 1880.

Figure 119: Continent and country of the setting.

Figure 120: Continent of the setting by country.

Figure 121: Continent of the setting per decade.

Figure 122: Continent of the setting before and in or after 1880.

Figure 123: Time periods of the setting relative to the authors’ birth year and publication year.

Figure 124: Time periods of the setting by country.

Figure 125: Time period of the setting per decade.

Figure 126: Time period of the setting before and in or after 1880.

Figure 127: Length of the novels in the corpus.

Figure 128: Length of the novels by country.

Figure 129: Length of the novels per decade.

Figure 130: Number of editions per work in Bib-ACMé and Conha19.

Figure 131: Editions per year in Bib-ACMé and Conha19.

Figure 132: Editions per decade in Bib-ACMé and Conha19.

Figure 133: Editions before and in or after 1880.

Figure 134: Editions by country in Bib-ACMé and Conha19.

Figure 135: Editions by place of publication in Bib-ACMé and Conha19.

Figure 136: Works with the label “novela” by decade.

Figure 137: Top 20 most frequent explicit subgenre labels in the bibliography.

Figure 138: Top 20 most frequent explicit subgenre labels in the corpus.

Figure 139: Works with an “identity label” by decade.

Figure 140: Top 20 most frequent subgenre signals in the bibliography.

Figure 141: Top 20 most frequent subgenre signals in the corpus.

Figure 142: Top 20 most frequent literary historical subgenre labels in the bibliography.

Figure 143: Top 20 most frequent literary historical subgenre labels in the corpus.

Figure 144: Number of different subgenre labels on discursive levels (in Bib-ACMé).

Figure 145: Overall number of subgenre labels on discursive levels (in Bib-ACMé).

Figure 146: Thematic subgenre labels in Bib-ACMé and Conha19.

Figure 147: Sources of thematic subgenres in Bib-ACMé.

Figure 148: Number of thematic labels per work.

Figure 149: Primary thematic subgenres of the works.

Figure 150: Subgenre labels related to literary currents in Bib-ACMé and Conha19.

Figure 151: Sources of subgenre labels related to literary currents in Bib-ACMé.

Figure 152: Publication years of works by literary current in Bib-ACMé.

Figure 153: Subgenre labels related to the mode of representation in Bib-ACMé and Conha19.

Figure 154: Sources of labels related to the mode of representation in Bib-ACMé.

Figure 155: Subgenre labels related to the mode of reality in Bib-ACMé and Conha19.

Figure 156: Sources of subgenre labels related to the mode of reality in Bib-ACMé.

Figure 157: Subgenres related to the linguistic, geographical, and socio-cultural identity.

Figure 158: Sources of identity subgenre labels in Bib-ACMé.

Figure 159: Constellations of identity groups in Conha19.

Figure 160: Subgenre labels related to medial aspects in Bib-ACMé and Conha19.

Figure 161: Sources of the subgenre labels related to medial aspects in Bib-ACMé.

Figure 162: Subgenre labels related to the attitude in Bib-ACMé and Conha19.

Figure 163: Sources of subgenre labels related to the attitude in Bib-ACMé.

Figure 164: Subgenre labels related to the intention in Bib-ACMé and Conha19.

Figure 165: Sources of subgenre labels related to the intention in Bib-ACMé.

Figure 166: Number of works per subgenre label.

Figure 167: Primary thematic subgenres in Bib-ACMé and Conha19.

Figure 168: Primary thematic subgenre labels in Bib-ACMé and Conha19 by country.

Figure 169: Primary thematic subgenre labels in Bib-ACMé per decade.

Figure 170: Primary thematic subgenre labels in Conha19 per decade.

Figure 171: Primary thematic subgenres in Bib-ACMé before and in or after 1880.

Figure 172: Primary thematic subgenres in Conha19 before and in or after 1880.

Figure 173: Primary thematic subgenre labels in Conha19 by prestige.

Figure 174: Primary thematic subgenre in Conha19 by narrative perspective.

Figure 175: Primary thematic subgenres in Conha19 by continent of the setting.

Figure 176: Primary thematic subgenres in Conha19 by time period of the setting.

Figure 177: Work lengths in tokens by primary thematic subgenre in Conha19.

Figure 178: Primary subgenres related to literary currents in Bib-ACMé and Conha19.

Figure 179: Primary subgenre labels related to literary currents in Bib-ACMé and Conha19 by country.

Figure 180: Primary subgenre labels related to literary currents in Bib-ACMé by decade.

Figure 181: Primary subgenre labels related to literary currents in Conha19 by decade.

Figure 182: Primary subgenres related to literary currents in Bib-ACMé before and in or after 1880.

Figure 183: Primary subgenres related to literary currents in Conha19 before and in or after 1880.

Figure 184: Primary subgenre labels related to literary currents in Conha19 by prestige.

Figure 185: Primary subgenre labels related to literary currents in Conha19 by narrative perspective.

Figure 186: Primary subgenre labels related to literary currents in Conha19 by continent of the setting.

Figure 187: Primary subgenres related to literary currents in Conha19 by time period of the setting.

Figure 188: Work length in tokens by primary subgenres related to literary currents in Conha19.