Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

Zhang, Liang; Lin, Jionghao; Sabatini, John; Borchers, Conrad; Weitekamp, Daniel; Cao, Meng; Hollander, John; Hu, Xiangen; Graesser, Arthur C.

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TLT.2025.3526582
Scopus: eid_2-s2.0-85214515034
WOS: WOS:001425521300001

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Faculty of Education: Journal/Magazine Articles

Article: Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

Title	Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI
Authors	Zhang, Liang Lin, Jionghao Sabatini, John Borchers, Conrad Weitekamp, Daniel Cao, Meng Hollander, John Hu, Xiangen Graesser, Arthur C.
Keywords	Data augmentation data sparsity generative artificial intelligence intelligent tutoring system learning performance data
Issue Date	2025
Citation	IEEE Transactions on Learning Technologies, 2025 How to Cite? DOI: http://dx.doi.org/10.1109/TLT.2025.3526582
Abstract	Learning performance data, such as correct or incorrect answers and problem-solving attempts in Intelligent Tutoring Systems (ITSs), facilitate the assessment of knowledge mastery and the delivery of effective instructions. However, these data tend to be highly sparse (80% ∼ 90% missing observations) in most real-world applications. This data sparsity presents challenges to using learner models to effectively predict learners' future performance and explore new hypotheses about learning. This article proposes a systematic framework for augmenting learning performance data to address data sparsity. First, learning performance data can be represented as a 3-Dimensional (3D) tensor with dimensions corresponding to learners, questions, and attempts, effectively capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, data augmentation using Generative Artificial Intelligence (GenAI) models, including Generative Adversarial Network, specifically Vanilla Generative Adversarial Networks (GAN), and Generative Pretrained Transformers (GPT, specifically GPT-4o), generate data tailored to individual clusters of learning performance. We tested this systemic framework on adult literacy datasets from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization outperformed baseline knowledge tracing techniques in tracing and predicting learning performance, demonstrating higher fidelity in data imputation, and 2) the Vanilla GAN-based augmentation demonstrated greater overall stability across varying sample sizes, whereas GPT-4o based augmentation exhibited higher variability, with occasional cases showing closer fidelity to the original data distribution. This framework facilitates the effective augmentation of learning performance data, enabling controlled, cost-effective approach for the evaluation and optimization of ITS instructional designs in both online and offline environments prior to deployment, and supporting advanced educational data mining and learning analytics.
Persistent Identifier	http://hdl.handle.net/10722/354419
ISI Accession Number ID	WOS:001425521300001

DC Field	Value	Language
dc.contributor.author	Zhang, Liang	-
dc.contributor.author	Lin, Jionghao	-
dc.contributor.author	Sabatini, John	-
dc.contributor.author	Borchers, Conrad	-
dc.contributor.author	Weitekamp, Daniel	-
dc.contributor.author	Cao, Meng	-
dc.contributor.author	Hollander, John	-
dc.contributor.author	Hu, Xiangen	-
dc.contributor.author	Graesser, Arthur C.	-
dc.date.accessioned	2025-02-07T08:48:29Z	-
dc.date.available	2025-02-07T08:48:29Z	-
dc.date.issued	2025	-
dc.identifier.citation	IEEE Transactions on Learning Technologies, 2025	-
dc.identifier.uri	http://hdl.handle.net/10722/354419	-
dc.description.abstract	Learning performance data, such as correct or incorrect answers and problem-solving attempts in Intelligent Tutoring Systems (ITSs), facilitate the assessment of knowledge mastery and the delivery of effective instructions. However, these data tend to be highly sparse (80% ∼ 90% missing observations) in most real-world applications. This data sparsity presents challenges to using learner models to effectively predict learners' future performance and explore new hypotheses about learning. This article proposes a systematic framework for augmenting learning performance data to address data sparsity. First, learning performance data can be represented as a 3-Dimensional (3D) tensor with dimensions corresponding to learners, questions, and attempts, effectively capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, data augmentation using Generative Artificial Intelligence (GenAI) models, including Generative Adversarial Network, specifically Vanilla Generative Adversarial Networks (GAN), and Generative Pretrained Transformers (GPT, specifically GPT-4o), generate data tailored to individual clusters of learning performance. We tested this systemic framework on adult literacy datasets from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization outperformed baseline knowledge tracing techniques in tracing and predicting learning performance, demonstrating higher fidelity in data imputation, and 2) the Vanilla GAN-based augmentation demonstrated greater overall stability across varying sample sizes, whereas GPT-4o based augmentation exhibited higher variability, with occasional cases showing closer fidelity to the original data distribution. This framework facilitates the effective augmentation of learning performance data, enabling controlled, cost-effective approach for the evaluation and optimization of ITS instructional designs in both online and offline environments prior to deployment, and supporting advanced educational data mining and learning analytics.	-
dc.language	eng	-
dc.relation.ispartof	IEEE Transactions on Learning Technologies	-
dc.subject	Data augmentation	-
dc.subject	data sparsity	-
dc.subject	generative artificial intelligence	-
dc.subject	intelligent tutoring system	-
dc.subject	learning performance data	-
dc.title	Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/TLT.2025.3526582	-
dc.identifier.scopus	eid_2-s2.0-85214515034	-
dc.identifier.eissn	1939-1382	-
dc.identifier.isi	WOS:001425521300001	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats