TY - JOUR
T1 - Harnessing Textbooks for High-Quality Labeled Data
T2 - 5th International Workshop on Intelligent Textbooks, iTextbooks 2023
AU - Pozzi, Lorenzo
AU - Alpizar-Chacon, Isaac
AU - Sosnovsky, Sergey
N1 - Publisher Copyright:
© 2023 Copyright for this paper by its authors.
PY - 2023
Y1 - 2023
N2 - As textbooks evolve into digital platforms, they open a world of opportunities for Artificial Intelligence in Education (AIED) research. This paper delves into the novel use of textbooks as a source of high-quality labeled data for automatic keyword extraction, demonstrating an affordable and efficient alternative to traditional methods. By utilizing the wealth of structured information provided in textbooks, we propose a methodology for annotating corpora across diverse domains, circumventing the costly and time-consuming process of manual data annotation. Our research presents a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) fine-tuned on this newly labeled dataset. This model is applied to keyword extraction tasks, with the model’s performance surpassing established baselines. We further analyze the transformation of BERT’s embedding space before and after the fine-tuning phase, illuminating how the model adapts to specific domain goals. Our findings substantiate textbooks as a resource-rich, untapped well of high-quality labeled data, underpinning their significant role in the AIED research landscape.
AB - As textbooks evolve into digital platforms, they open a world of opportunities for Artificial Intelligence in Education (AIED) research. This paper delves into the novel use of textbooks as a source of high-quality labeled data for automatic keyword extraction, demonstrating an affordable and efficient alternative to traditional methods. By utilizing the wealth of structured information provided in textbooks, we propose a methodology for annotating corpora across diverse domains, circumventing the costly and time-consuming process of manual data annotation. Our research presents a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) fine-tuned on this newly labeled dataset. This model is applied to keyword extraction tasks, with the model’s performance surpassing established baselines. We further analyze the transformation of BERT’s embedding space before and after the fine-tuning phase, illuminating how the model adapts to specific domain goals. Our findings substantiate textbooks as a resource-rich, untapped well of high-quality labeled data, underpinning their significant role in the AIED research landscape.
KW - automatic keyword extraction
KW - BERT fine-tuning
KW - labeled data
KW - textbooks
UR - http://www.scopus.com/inward/record.url?scp=85168995452&partnerID=8YFLogxK
M3 - Artículo de la conferencia
AN - SCOPUS:85168995452
SN - 1613-0073
VL - 3444
SP - 66
EP - 77
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
Y2 - 3 July 2023
ER -