• Open supplemental data
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, explainable automated essay scoring: deep learning really has pedagogical value.

student essay dataset

  • School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Edmonton, AB, Canada

Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize’s essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports (a) a list of features that are important for a specific essay (locally), (b) a range of values for each feature that contribute to higher or lower rubric scores, and (c) a model that allows to quantify the impact of the implementation of formative feedback.

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; only a few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores ( Kumar et al., 2017 ; Taghipour, 2017 ; Kumar and Boulanger, 2020 ). None has attempted to explain the whole decision process of AES, from holistic scores to rubric scores and from rubric scores to writing feature modeling. Although several algorithms from XAI (explainable artificial intelligence) ( Adadi and Berrada, 2018 ; Murdoch et al., 2019 ) have recently been published (e.g., LIME, SHAP) ( Ribeiro et al., 2016 ; Lundberg and Lee, 2017 ), no research has yet investigated the role that these explanation models (trained on top of predictive models) can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing teachers and students with personalized, formative, and fine-grained feedback during the writing process.

One of the key anticipated benefits of AES is the elimination of human bias such as rater fatigue, rater’s expertise, severity/leniency, scale shrinkage, stereotyping, Halo effect, rater drift, perception difference, and inconsistency ( Taghipour, 2017 ). At its turn, AES may suffer from its own set of biases (e.g., imperfections in training data, spurious correlations, overrepresented minority groups), which has incited the research community to look for ways to make AES more transparent, accountable, fair, unbiased, and consequently trustworthy while remaining accurate. This required changing the perception that AES is merely a machine learning and feature engineering task ( Madnani et al., 2017 ; Madnani and Cahill, 2018 ). Hence, researchers have advocated that AES should be seen as a shared task requiring several methodological design decisions along the way such as curriculum alignment, construction of training corpora, reliable scoring process, and rater performance evaluation, where the goal is to build and deploy fair and unbiased scoring models to be used in large-scale assessments and classroom settings ( Rupp, 2018 ; West-Smith et al., 2018 ; Rupp et al., 2019 ). Unfortunately, although these measures are intended to design reliable and valid AES systems, they may still fail to build trust among users, keeping the AES black box impenetrable for teachers and students.

It has been previously recognized that divergence of opinion among human and machine graders has been only investigated superficially ( Reinertsen, 2018 ). So far, researchers investigated the characteristics of essays through qualitative analyses which ended up rejected by AES systems (requiring a human to score them) ( Reinertsen, 2018 ). Others strived to justify predicted scores by identifying essay segments that actually caused the predicted scores. In spite of the fact that these justifications hinted at and quantified the importance of these spatial cues, they did not provide any feedback as to how to improve those suboptimal essay segments ( Mizumoto et al., 2019 ).

Related to this study and the work of Kumar and Boulanger (2020) is Revision Assistant, a commercial AES system developed by Turnitin ( Woods et al., 2017 ; West-Smith et al., 2018 ), which in addition to predicting essays’ holistic scores provides formative, rubric-specific, and sentence-level feedback over multiple drafts of a student’s essay. The implementation of Revision Assistant moved away from the traditional approach to AES, which consists in using a limited set of features engineered by human experts representing only high-level characteristics of essays. Like this study, it rather opted for including a large number of low-level writing features, demonstrating that expert-designed features are not required to produce interpretable predictions. Revision Assistant’s performance was reported on two essay datasets, one of which was the Automated Student Assessment Prize (ASAP) 1 dataset. However, performance on the ASAP dataset was reported in terms of quadratic weighted kappa and this for holistic scores only. Models predicting rubric scores were trained only with the other dataset which was hosted on and collected through Revision Assistant itself.

In contrast to feature-based approaches like the one adopted by Revision Assistant, other AES systems are implemented using deep neural networks where features are learned during model training. For example, Taghipour (2017) in his doctoral dissertation leverages a recurrent neural network to improve accuracy in predicting holistic scores, implement rubric scoring (i.e., organization and argument strength), and distinguish between human-written and computer-generated essays. Interestingly, Taghipour compared the performance of his AES system against other AES systems using the ASAP corpora, but he did not use the ASAP corpora when it came to train rubric scoring models although ASAP provides two corpora provisioning rubric scores (#7 and #8). Finally, research was also undertaken to assess the generalizability of rubric-based models by performing experiments across various datasets. It was found that the predictive power of such rubric-based models was related to how much the underlying feature set covered a rubric’s criteria ( Rahimi et al., 2017 ).

Despite their numbers, rubrics (e.g., organization, prompt adherence, argument strength, essay length, conventions, word choices, readability, coherence, sentence fluency, style, audience, ideas) are usually investigated in isolation and not as a whole, with the exception of Revision Assistant which provides feedback at the same time on the following five rubrics: claim, development, audience, cohesion, and conventions. The literature reveals that rubric-specific automated feedback includes numerical rubric scores as well as recommendations on how to improve essay quality and correct errors ( Taghipour, 2017 ). Again, except for Revision Assistant which undertook a holistic approach to AES including holistic and rubric scoring and provision of rubric-specific feedback at the sentence level, AES has generally not been investigated as a whole or as an end-to-end product. Hence, the AES used in this study and developed by Kumar and Boulanger (2020) is unique in that it uses both deep learning (multi-layer perceptron neural network) and a huge pool of linguistic indices (1592), predicts both holistic and rubric scores, explaining holistic scores in terms of rubric scores, and reports which linguistic indices are the most important by rubric. This study, however, goes one step further and showcases how to explain the decision process behind the prediction of a rubric score for a specific essay, one of the main AES limitations identified in the literature ( Taghipour, 2017 ) that this research intends to address, at least partially.

Besides providing explanations of predictions both globally and individually, this study not only goes one step further toward the automated provision of formative feedback but also does so in alignment with the explanation model and the predictive model, allowing to better map feedback to the actual characteristics of an essay. Woods et al. (2017) succeeded in associating sentence-level expert-derived feedback with strong/weak sentences having the greatest influence on a rubric score based on the rubric, essay score, and the sentence characteristics. While Revision Assistant’s feature space consists of counts and binary occurrence indicators of word unigrams, bigrams and trigrams, character four-grams, and part-of-speech bigrams and trigrams, they are mainly textual and locational indices; by nature they are not descriptive or self-explanative. This research fills this gap by proposing feedback based on a set of linguistic indices that can encompass several sentences at a time. However, the proposed approach omits locational hints, leaving the merging of the two approaches as the next step to be addressed by the research community.

Although this paper proposes to extend the automated provision of formative feedback through an interpretable machine learning method, it rather focuses on the feasibility of automating it in the context of AES instead of evaluating the pedagogical quality (such as the informational and communicational value of feedback messages) or impact on students’ writing performance, a topic that will be kept for an upcoming study. Having an AES system that is capable of delivering real-time formative feedback sets the stage to investigate (1) when feedback is effective, (2) the types of feedback that are effective, and (3) whether there exist different kinds of behaviors in terms of seeking and using feedback ( Goldin et al., 2017 ). Finally, this paper omits describing the mapping between the AES model’s linguistic indices and a pedagogical language that is easily understandable by students and teachers, which is beyond its scope.

Methodology

This study showcases the application of the PDR framework ( Murdoch et al., 2019 ), which provides three pillars to describe interpretations in the context of the data science life cycle: P redictive accuracy, D escriptive accuracy, and R elevancy to human audience(s). It is important to note that in a broader sense both terms “explainable artificial intelligence” and “interpretable machine learning” can be used interchangeably with the following meaning ( Murdoch et al., 2019 ): “the use of machine-learning models for the extraction of relevant knowledge about domain relationships contained in data.” Here “predictive accuracy” refers to the measurement of a model’s ability to fit data; “descriptive accuracy” is the degree at which the relationships learned by a machine learning model can be objectively captured; and “relevant knowledge” implies that a particular audience gets insights into a chosen domain problem that guide its communication, actions, and discovery ( Murdoch et al., 2019 ).

In the context of this article, formative feedback that assesses students’ writing skills and prescribes remedial writing strategies is the relevant knowledge sought for, whose effectiveness on students’ writing performance will be validated in an upcoming study. However, the current study puts forward the tools and evaluates the feasibility to offer this real-time formative feedback. It also measures the predictive and descriptive accuracies of AES and explanation models, two key components to generate trustworthy interpretations ( Murdoch et al., 2019 ). Naturally, the provision of formative feedback is dependent on the speed of training and evaluating new explanation models every time a new essay is ingested by the AES system. That is why this paper investigates the potential of various SHAP implementations for speed optimization without compromising the predictive and descriptive accuracies. This article will show how the insights generated by the explanation model can serve to debug the predictive model and contribute to enhance the feature selection and/or engineering process ( Murdoch et al., 2019 ), laying the foundation for the provision of actionable and impactful pieces of knowledge to educational audiences, whose relevancy will be judged by the human stakeholders and estimated by the magnitude of resulting changes.

Figure 1 overviews all the elements and steps encompassed by the AES system in this study. The following subsections will address each facet of the overall methodology, from hyperparameter optimization to relevancy to both students and teachers.

www.frontiersin.org

Figure 1. A flow chart exhibiting the sequence of activities to develop an end-to-end AES system and how the various elements work together to produce relevant knowledge to the intended stakeholders.

Automated Essay Scoring System, Dataset, and Feature Selection

As previously mentioned, this paper reuses the AES system developed by Kumar and Boulanger (2020) . The AES models were trained using the ASAP’s seventh essay corpus. These narrative essays were written by Grade-7 students in the setting of state-wide assessments in the United States and had an average length of 171 words. Students were asked to write a story about patience. Kumar and Boulanger’s work consisted in training a predictive model for each of the four rubrics according to which essays were graded: ideas, organization, style, and conventions. Each essay was scored by two human raters on a 0−3 scale (integer scale). Rubric scores were resolved by adding the rubric scores assigned by the two human raters, producing a resolved rubric score between 0 and 6. This paper is a continuation of Boulanger and Kumar (2018 , 2019 , 2020) and Kumar and Boulanger (2020) where the objective is to open the AES black box to explain the holistic and rubric scores that it predicts. Essentially, the holistic score ( Boulanger and Kumar, 2018 , 2019 ) is determined and justified through its four rubrics. Rubric scores, in turn, are investigated to highlight the writing features that play an important role within each rubric ( Kumar and Boulanger, 2020 ). Finally, beyond global feature importance, it is not only indispensable to identify which writing indices are important for a particular essay (local), but also to discover how they contribute to increase or decrease the predicted rubric score, and which feature values are more/less desirable ( Boulanger and Kumar, 2020 ). This paper is a continuation of these previous works by adding the following link to the AES chain: holistic score, rubric scores, feature importance, explanations, and formative feedback. The objective is to highlight the means for transparent and trustable AES while empowering learning analytics practitioners with the tools to debug these models and equip educational stakeholders with an AI companion that will semi-autonomously generate formative feedback to teachers and students. Specifically, this paper analyzes the AES reasoning underlying its assessment of the “style” rubric, which looks for command of language, including effective and compelling word choice and varied sentence structure, that clearly supports the writer’s purpose and audience.

This research’s approach to AES leverages a feature-based multi-layer perceptron (MLP) deep neural network to predict rubric scores. The AES system is fed by 1592 linguistic indices quantitatively measured by the Suite of Automatic Linguistic Analysis Tools 2 (SALAT), which assess aspects of grammar and mechanics, sentiment analysis and cognition, text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity ( Kumar and Boulanger, 2020 ). The purpose of using such a huge pool of low-level writing features is to let deep learning extract the most important ones; the literature supports this practice since there is evidence that features automatically selected are not less interpretable than those engineered ( Woods et al., 2017 ). However, to facilitate this process, this study opted for a semi-automatic strategy that consisted of both filter and embedded methods. Firstly, the original ASAP’s seventh essay dataset consists of a training set of 1567 essays and a validation and testing sets of 894 essays combined. While the texts of all 2461 essays are still available to the public, only the labels (the rubric scores of two human raters) of the training set have been shared with the public. Yet, this paper reused the unlabeled 894 essays of the validation and testing sets for feature selection, a process that must be carefully carried out by avoiding being informed by essays that will train the predictive model. Secondly, feature data were normalized, and features with variances lower than 0.01 were pruned. Thirdly, the last feature of any pair of features having an absolute Pearson correlation coefficient greater than 0.7 was also pruned (the one that comes last in terms of the column ordering in the datasets). After the application of these filter methods, the number of features was reduced from 1592 to 282. Finally, the Lasso and Ridge regression regularization methods (whose combination is also called ElasticNet) were applied during the training of the rubric scoring models. Lasso is responsible for pruning further features, while Ridge regression is entrusted with eliminating multicollinearity among features.

Hyperparameter Optimization and Training

To ensure a fair evaluation of the potential of deep learning, it is of utmost importance to minimally describe this study’s exploration of the hyperparameter space, a step that is often found to be missing when reporting the outcomes of AES models’ performance ( Kumar and Boulanger, 2020 ). First, a study should list the hyperparameters it is going to investigate by testing for various values of each hyperparameter. For example, Table 1 lists all hyperparameters explored in this study. Note that L 1 and L 2 are two regularization hyperparameters contributing to feature selection. Second, each study should also report the range of values of each hyperparameter. Finally, the strategy to explore the selected hyperparameter subspace should be clearly defined. For instance, given the availability of high-performance computing resources and the time/cost of training AES models, one might favor performing a grid (a systematic testing of all combinations of hyperparameters and hyperparameter values within a subspace) or a random search (randomly selecting a hyperparameter value from a range of values per hyperparameter) or both by first applying random search to identify a good starting candidate and then grid search to test all possible combinations in the vicinity of the starting candidate’s subspace. Of particular interest to this study is the neural network itself, that is, how many hidden layers should a neural network have and how many neurons should compose each hidden layer and the neural network as a whole. These two variables are directly related to the size of the neural network, with the number of hidden layers being a defining trait of deep learning. A vast swath of literature is silent about the application of interpretable machine learning in AES and even more about measuring its descriptive accuracy, the two components of trustworthiness. Hence, this study pioneers the comprehensive assessment of deep learning impact on AES’s predictive and descriptive accuracies.

www.frontiersin.org

Table 1. Hyperparameter subspace investigated in this article along with best hyperparameter values per neural network architecture.

Consequently, the 1567 labeled essays were divided into a training set (80%) and a testing set (20%). No validation set was put aside; 5-fold cross-validation was rather used for hyperparameter optimization. Table 1 delineates the hyperparameter subspace from which 800 different combinations of hyperparameter values were randomly selected out of a subspace of 86,248,800 possible combinations. Since this research proposes to investigate the potential of deep learning to predict rubric scores, several architectures consisting of 2 to 6 hidden layers and ranging from 9,156 to 119,312 parameters were tested. Table 1 shows the best hyperparameter values per depth of neural networks.

Again, the essays of the testing set were never used during the training and cross-validation processes. In order to retrieve the best predictive models during training, every time the validation loss reached a record low, the model was overwritten. Training stopped when no new record low was reached during 100 epochs. Moreover, to avoid reporting the performance of overfit models, each model was trained five times using the same set of best hyperparameter values. Finally, for each resulting predictive model, a corresponding ensemble model (bagging) was also obtained out of the five models trained during cross-validation.

Predictive Models and Predictive Accuracy

Table 2 delineates the performance of predictive models trained previously by Kumar and Boulanger (2020) on the four scoring rubrics. The first row lists the agreement levels between the resolved and predicted rubric scores measured by the quadratic weighted kappa. The second row is the percentage of accurate predictions; the third row reports the percentages of predictions that are either accurate or off by 1; and the fourth row reports the percentages of predictions that are either accurate or at most off by 2. Prediction of holistic scores is done merely by adding up all rubric scores. Since the scale of rubric scores is 0−6 for every rubric, then the scale of holistic scores is 0−24.

www.frontiersin.org

Table 2. Rubric scoring models’ performance on testing set.

While each of these rubric scoring models might suffer from its own systemic bias and hence cancel off each other’s bias by adding up the rubric scores to derive the holistic score, this study (unlike related works) intends to highlight these biases by exposing the decision making process underlying the prediction of rubric scores. Although this paper exclusively focuses on the Style rubric, the methodology put forward to analyze the local and global importance of writing indices and their context-specific contributions to predicted rubric scores is applicable to every rubric and allows to control for these biases one rubric at a time. Comparing and contrasting the role that a specific writing index plays within each rubric context deserves its own investigation, which has been partly addressed in the study led by Kumar and Boulanger (2020) . Moreover, this paper underscores the necessity to measure the predictive accuracy of rubric-based holistic scoring using additional metrics to account for these rubric-specific biases. For example, there exist several combinations of rubric scores to obtain a holistic score of 16 (e.g., 4-4-4-4 vs. 4-3-4-5 vs. 3-5-2-6). Even though the predicted holistic score might be accurate, the rubric scores could all be inaccurate. Similarity or distance metrics (e.g., Manhattan and Euclidean) should then be used to describe the authenticity of the composition of these holistic scores.

According to what Kumar and Boulanger (2020) report on the performance of several state-of-the-art AES systems trained on ASAP’s seventh essay dataset, the AES system they developed and which will be reused in this paper proved competitive while being fully and deeply interpretable, which no other AES system does. They also supply further information about the study setting, essay datasets, rubrics, features, natural language processing (NLP) tools, model training, and evaluation against human performance. Again, this paper showcases the application of explainable artificial intelligence in automated essay scoring by focusing on the decision process of the Rubric #3 (Style) scoring model. Remember that the same methodology is applicable to each rubric.

Explanation Model: SHAP

SH apley A dditive ex P lanations (SHAP) is a theoretically justified XAI framework that can provide simultaneously both local and global explanations ( Molnar, 2020 ); that is, SHAP is able to explain individual predictions taking into account the uniqueness of each prediction, while highlighting the global factors influencing the overall performance of a predictive model. SHAP is of keen interest because it unifies all algorithms of the class of additive feature attribution methods, adhering to a set of three properties that are desirable in interpretable machine learning: local accuracy, missingness, and consistency ( Lundberg and Lee, 2017 ). A key advantage of SHAP is that feature contributions are all expressed in terms of the outcome variable (e.g., rubric scores), providing a same scale to compare the importance of each feature against each other. Local accuracy refers to the fact that no matter the explanation model, the sum of all feature contributions is always equal to the prediction explained by these features. The missingness property implies that the prediction is never explained by unmeasured factors, which are always assigned a contribution of zero. However, the converse is not true; a contribution of zero does not imply an unobserved factor, it can also denote a feature irrelevant to explain the prediction. The consistency property guarantees that a more important feature will always have a greater magnitude than a less important one, no matter how many other features are included in the explanation model. SHAP proves superior to other additive attribution methods such as LIME (Local Interpretable Model-Agnostic Explanations), Shapley values, and DeepLIFT in that they never comply with all three properties, while SHAP does ( Lundberg and Lee, 2017 ). Moreover, the way SHAP assesses the importance of a feature differs from permutation importance methods (e.g., ELI5), measured as the decrease in model performance (accuracy) as a feature is perturbated, in that it is based on how much a feature contributes to every prediction.

Essentially, a SHAP explanation model (linear regression) is trained on top of a predictive model, which in this case is a complex ensemble deep learning model. Table 3 demonstrates a scale explanation model showing how SHAP values (feature contributions) work. In this example, there are five instances and five features describing each instance (in the context of this paper, an instance is an essay). Predictions are listed in the second to last column, and the base value is the mean of all predictions. The base value constitutes the reference point according to which predictions are explained; in other words, reasons are given to justify the discrepancy between the individual prediction and the mean prediction (the base value). Notice that the table does not contain the actual feature values; these are SHAP values that quantify the contribution of each feature to the predicted score. For example, the prediction of Instance 1 is 2.46, while the base value is 3.76. Adding up the feature contributions of Instance 1 to the base value produces the predicted score:

www.frontiersin.org

Table 3. Array of SHAP values: local and global importance of features and feature coverage per instance.

Hence, the generic equation of the explanation model ( Lundberg and Lee, 2017 ) is:

where g(x) is the prediction of an individual instance x, σ 0 is the base value, σ i is the feature contribution of feature x i , x i ∈ {0,1} denotes whether feature x i is part of the individual explanation, and j is the total number of features. Furthermore, the global importance of a feature is calculated by adding up the absolute values of its corresponding SHAP values over all instances, where n is the total number of instances and σ i ( j ) is the feature contribution for instance i ( Lundberg et al., 2018 ):

Therefore, it can be seen that Feature 3 is the most globally important feature, while Feature 2 is the least important one. Similarly, Feature 5 is Instance 3’s most important feature at the local level, while Feature 2 is the least locally important. The reader should also note that a feature shall not necessarily be assigned any contribution; some of them are just not part of the explanation such as Feature 2 and Feature 3 in Instance 2. These concepts lay the foundation for the explainable AES system presented in this paper. Just imagine that each instance (essay) will be rather summarized by 282 features and that the explanations of all the testing set’s 314 essays will be provided.

Several implementations of SHAP exist: KernelSHAP, DeepSHAP, GradientSHAP, and TreeSHAP, among others. KernelSHAP is model-agnostic and works for any type of predictive models; however, KernelSHAP is very computing-intensive which makes it undesirable for practical purposes. DeepSHAP and GradientSHAP are two implementations intended for deep learning which takes advantage of the known properties of neural networks (i.e., MLP-NN, CNN, or RNN) to accelerate up to three orders of magnitude the processing time to explain predictions ( Chen et al., 2019 ). Finally, TreeSHAP is the most powerful implementation intended for tree-based models. TreeSHAP is not only fast; it is also accurate. While the three former implementations estimate SHAP values, TreeSHAP computes them exactly. Moreover, TreeSHAP not only measures the contribution of individual features, but it also considers interactions between pairs of features and assigns them SHAP values. Since one of the goals of this paper is to assess the potential of deep learning on the performance of both predictive and explanation models, this research tested the former three implementations. TreeSHAP is recommended for future work since the interaction among features is critical information to consider. Moreover, KernelSHAP, DeepSHAP, and GradientSHAP all require access to the whole original dataset to derive the explanation of a new instance, another constraint TreeSHAP is not subject to.

Descriptive Accuracy: Trustworthiness of Explanation Models

This paper reuses and adapts the methodology introduced by Ribeiro et al. (2016) . Several explanation models will be trained, using different SHAP implementations and configurations, per deep learning predictive model (for each number of hidden layers). The rationale consists in randomly selecting and ignoring 25% of the 282 features feeding the predictive model (e.g., turning them to zero). If it causes the prediction to change beyond a specific threshold (in this study 0.10 and 0.25 were tested), then the explanation model should also reflect the magnitude of this change while ignoring the contributions of these same features. For example, the original predicted rubric score of an essay might be 5; however, when ignoring the information brought in by a subset of 70 randomly selected features (25% of 282), the prediction may turn to 4. On the other side, if the explanation model also predicts a 4 while ignoring the contributions of the same subset of features, then the explanation is considered as trustworthy. This allows to compute the precision, recall, and F1-score of each explanation model (number of true and false positives and true and false negatives). The process is repeated 500 times for every essay to determine the average precision and recall of every explanation model.

Judging Relevancy

So far, the consistency of explanations with predictions has been considered. However, consistent explanations do not imply relevant or meaningful explanations. Put another way, explanations only reflect what predictive models have learned during training. How can the black box of these explanations be opened? Looking directly at the numerical SHAP values of each explanation might seem a daunting task, but there exist tools, mainly visualizations (decision plot, summary plot, and dependence plot), that allow to make sense out of these explanations. However, before visualizing these explanations, another question needs to be addressed: which explanations or essays should be picked for further scrutiny of the AES system? Given the huge number of essays to examine and the tedious task to understand the underpinnings of a single explanation, a small subset of essays should be carefully picked that should represent concisely the state of correctness of the underlying predictive model. Again, this study applies and adapts the methodology in Ribeiro et al. (2016) . A greedy algorithm selects essays whose predictions are explained by as many features of global importance as possible to optimize feature coverage. Ribeiro et al. demonstrated in unrelated studies (i.e., sentiment analysis) that the correctness of a predictive model can be assessed with as few as four or five well-picked explanations.

For example, Table 3 reveals the global importance of five features. The square root of each feature’s global importance is also computed and considered instead to limit the influence of a small group of very influential features. The feature coverage of Instance 1 is 100% because all features are engaged in the explanation of the prediction. On the other hand, Instance 2 has a feature coverage of 61.5% because only Features 1, 4, and 5 are part of the prediction’s explanation. The feature coverage is calculated by summing the square root of each explanation’s feature’s global importance together and dividing by the sum of the square roots of all features’ global importance:

Additionally, it can be seen that Instance 4 does not have any zero-feature value although its feature coverage is only 84.6%. The algorithm was constrained to discard from the explanation any feature whose contribution (local importance) was too close to zero. In the case of Table 3 ’s example, any feature whose absolute SHAP value is less than 0.10 is ignored, hence leading to a feature coverage of:

In this paper’s study, the real threshold was 0.01. This constraint was actually a requirement for the DeepSHAP and GradientSHAP implementations because they only output non-zero SHAP values contrary to KernelSHAP which generates explanations with a fixed number of features: a non-zero SHAP value indicates that the feature is part of the explanation, while a zero value excludes the feature from the explanation. Without this parameter, all 282 features would be part of the explanation although a huge number only has a trivial (very close to zero) SHAP value. Now, a much smaller but variable subset of features makes up each explanation. This is one way in which Ribeiro et al.’s SP-LIME algorithm (SP stands for Submodular Pick) has been adapted to this study’s needs. In conclusion, notice how Instance 4 would be selected in preference to Instance 5 to explain Table 3 ’s underlying predictive model. Even though both instances have four features explaining their prediction, Instance 4’s features are more globally important than Instance 5’s features, and therefore Instance 4 has greater feature coverage than Instance 5.

Whereas Table 3 ’s example exhibits the feature coverage of one instance at a time, this study computes it for a subset of instances, where the absolute SHAP values are aggregated (summed) per candidate subset. When the sum of absolute SHAP values per feature exceeds the set threshold, the feature is then considered as covered by the selected set of instances. The objective in this study was to optimize the feature coverage while minimizing the number of essays to validate the AES model.

Research Questions

One of this article’s objectives is to assess the potential of deep learning in automated essay scoring. The literature has often claimed ( Hussein et al., 2019 ) that there are two approaches to AES, feature-based and deep learning, as though these two approaches were mutually exclusive. Yet, the literature also puts forward that feature-based AES models may be more interpretable than deep learning ones ( Amorim et al., 2018 ). This paper embraces the viewpoint that these two approaches can also be complementary by leveraging the state-of-the-art in NLP and automatic linguistic analysis and harnessing one of the richest pools of linguistic indices put forward in the research community ( Crossley et al., 2016 , 2017 , 2019 ; Kyle, 2016 ; Kyle et al., 2018 ) and applying a thorough feature selection process powered by deep learning. Moreover, the ability of deep learning of modeling complex non-linear relationships makes it particularly well-suited for AES given that the importance of a writing feature is highly dependent on its context, that is, its interactions with other writing features. Besides, this study leverages the SHAP interpretation method that is well-suited to interpret very complex models. Hence, this study elected to work with deep learning models and ensembles to test SHAP’s ability to explain these complex models. Previously, the literature has revealed the difficulty to have at the same time both accurate and interpretable models ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ), where favoring one comes at the expense of the other. However, this research shows how XAI makes it now possible to produce both accurate and interpretable models in the area of AES. Since ensembles have been repeatedly shown to boost the accuracy of predictive models, they were included as part of the tested deep learning architectures to maximize generalizability and accuracy, while making these predictive models interpretable and exploring whether deep learning can even enhance their descriptive accuracy further.

This study investigates the trustworthiness of explanation models, and more specifically, those explaining deep learning predictive models. For instance, does the depth, defined as the number of hidden layers, of an MLP neural network increases the trustworthiness of its SHAP explanation model? The answer to this question will help determine whether it is possible to have very accurate AES models while having competitively interpretable/explainable models, the corner stone for the generation of formative feedback. Remember that formative feedback is defined as “any kind of information provided to students about their actual state of learning or performance in order to modify the learner’s thinking or behavior in the direction of the learning standards” and that formative feedback “conveys where the student is, what are the goals to reach, and how to reach the goals” ( Goldin et al., 2017 ). This notion contrasts with summative feedback which basically is “a justification of the assessment results” ( Hao and Tsikerdekis, 2019 ).

As pointed out in the previous section, multiple SHAP implementations are evaluated in this study. Hence, this paper showcases whether the faster DeepSHAP and GradientSHAP implementations are as reliable as the slower KernelSHAP implementation . The answer to this research question will shed light on the feasibility of providing immediate formative feedback and this multiple times throughout students’ writing processes.

This study also looks at whether a summary of the data produces as trustworthy explanations as those from the original data . This question will be of interest to AES researchers and practitioners because it could allow to significantly decrease the processing time of the computing-intensive and model-agnostic KernelSHAP implementation and test further the potential of customizable explanations.

KernelSHAP allows to specify the total number of features that will shape the explanation of a prediction; for instance, this study experiments with explanations of 16 and 32 features and observes whether there exists a statistically significant difference in the reliability of these explanation models . Knowing this will hint at whether simpler or more complex explanations are more desirable when it comes to optimize their trustworthiness. If there is no statistically significant difference, then AES practitioners are given further flexibility in the selection of SHAP implementations to find the sweet spot between complexity of explanations and speed of processing. For instance, the KernelSHAP implementation allows to customize the number of factors making up an explanation, while the faster DeepSHAP and GradientSHAP do not.

Finally, this paper highlights the means to debug and compare the performance of predictive models through their explanations. Once a model is debugged, the process can be reused to fine-tune feature selection and/or feature engineering to improve predictive models and for the generation of formative feedback to both students and teachers.

The training, validation, and testing sets consist of 1567 essays, each of which has been scored by two human raters, who assigned a score between 0 and 3 per rubric (ideas, organization, style, and conventions). In particular, this article looks at predictive and descriptive accuracy of AES models on the third rubric, style. Note that although each essay has been scored by two human raters, the literature ( Shermis, 2014 ) is not explicit about whether only two or more human raters participated in the scoring of all 1567 essays; given the huge number of essays, it is likely that more than two human raters were involved in the scoring of these essays so that the amount of noise introduced by the various raters’ biases is unknown while probably being at some degree balanced among the two groups of raters. Figure 2 shows the confusion matrices of human raters on Style Rubric. The diagonal elements (dark gray) correspond to exact matches, whereas the light gray squares indicate adjacent matches. Figure 2A delineates the number of essays per pair of ratings, and Figure 2B shows the percentages per pair of ratings. The agreement level between each pair of human raters, measured by the quadratic weighted kappa, is 0.54; the percentage of exact matches is 65.3%; the percentage of adjacent matches is 34.4%; and 0.3% of essays are neither exact nor adjacent matches. Figures 2A,B specify the distributions of 0−3 ratings per group of human raters. Figure 2C exhibits the distribution of resolved scores (a resolved score is the sum of the two human ratings). The mean is 3.99 (with a standard deviation of 1.10), and the median and mode are 4. It is important to note that the levels of predictive accuracy reported in this article are measured on the scale of resolved scores (0−6) and that larger scales tend to slightly inflate quadratic weighted kappa values, which must be taken into account when comparing against the level of agreement between human raters. Comparison of percentages of exact and adjacent matches must also be made with this scoring scale discrepancy in mind.

www.frontiersin.org

Figure 2. Summary of the essay dataset (1567 Grade-7 narrative essays) investigated in this study. (A) Number of essays per pair of human ratings; the diagonal (dark gray squares) lists the numbers of exact matches while the light-gray squares list the numbers of adjacent matches; and the bottom row and the rightmost column highlight the distributions of ratings for both groups of human raters. (B) Percentages of essays per pair of human ratings; the diagonal (dark gray squares) lists the percentages of exact matches while the light-gray squares list the percentages of adjacent matches; and the bottom row and the rightmost column highlight the distributions (frequencies) of ratings for both groups of human raters. (C) The distribution of resolved rubric scores; a resolved score is the addition of its two constituent human ratings.

Predictive Accuracy and Descriptive Accuracy

Table 4 compiles the performance outcomes of the 10 predictive models evaluated in this study. The reader should remember that the performance of each model was averaged over five iterations and that two models were trained per number of hidden layers, one non-ensemble and one ensemble. Except for the 6-layer models, there is no clear winner among other models. Even for the 6-layer models, they are superior in terms of exact matches, the primary goal for a reliable AES system, but not according to adjacent matches. Nevertheless, on average ensemble models slightly outperform non-ensemble models. Hence, these ensemble models will be retained for the next analysis step. Moreover, given that five ensemble models were trained per neural network depth, the most accurate model among the five is selected and displayed in Table 4 .

www.frontiersin.org

Table 4. Performance of majority classifier and average/maximal performance of trained predictive models.

Next, for each selected ensemble predictive model, several explanation models are trained per predictive model. Every predictive model is explained by the “Deep,” “Grad,” and “Random” explainers, except for the 6-layer model where it was not possible to train a “Deep” explainer apparently due to a bug in the original SHAP code caused by either a unique condition in this study’s data or neural network architecture. However, this was beyond the scope of this study to fix and investigate this issue. As it will be demonstrated, no statistically significant difference exists between the accuracy of these explainers.

The “Random” explainer serves as a baseline model for comparison purpose. Remember that to evaluate the reliability of explanation models, the concurrent impact of randomly selecting and ignoring a subset of features on the prediction and explanation of rubric scores is analyzed. If the prediction changes significantly and its corresponding explanation changes (beyond a set threshold) accordingly (a true positive) or if the prediction remains within the threshold as does the explanation (a true negative), then the explanation is deemed as trustworthy. Hence, in the case of the Random explainer, it simulates random explanations by randomly selecting 32 non-zero features from the original set of 282 features. These random explanations consist only of non-zero features because, according to SHAP’s missingness property, a feature with a zero or a missing value never gets assigned any contribution to the prediction. If at least one of these 32 features is also an element of the subset of the ignored features, then the explanation is considered as untrustworthy, no matter the size of a feature’s contribution.

As for the layer-2 model, six different explanation models are evaluated. Recall that layer-2 models generated the least mean squared error (MSE) during hyperparameter optimization (see Table 1 ). Hence, this specific type of architecture was selected to test the reliability of these various explainers. The “Kernel” explainer is the most computing-intensive and took approximately 8 h of processing. It was trained using the full distributions of feature values in the training set and shaped explanations in terms of 32 features; the “Kernel-16” and “Kernel-32” models were trained on a summary (50 k -means centroids) of the training set to accelerate the processing by about one order of magnitude (less than 1 h). Besides, the “Kernel-16” explainer derived explanations in terms of 16 features, while the “Kernel-32” explainer explained predictions through 32 features. Table 5 exhibits the descriptive accuracy of these various explanation models according to a 0.10 and 0.25 threshold; in other words, by ignoring a subset of randomly picked features, it assesses whether or not the prediction and explanation change simultaneously. Note also how each explanation model, no matter the underlying predictive model, outperforms the “Random” model.

www.frontiersin.org

Table 5. Precision, recall, and F1 scores of the various explainers tested per type of predictive model.

The first research question addressed in this subsection asks whether there exists a statistically significant difference between the “Kernel” explainer, which generates 32-feature explanations and is trained on the whole training set, and the “Kernel-32” explainer which also generates 32-feature explanations and is trained on a summary of the training set. To determine this, an independent t-test was conducted using the precision, recall, and F1-score distributions (500 iterations) of both explainers. Table 6 reports the p -values of all the tests and for the 0.10 and 0.25 thresholds. It reveals that there is no statistically significant difference between the two explainers.

www.frontiersin.org

Table 6. p -values of independent t -tests comparing whether there exist statistically significant differences between the mean precisions, recalls, and F1-scores of 2-layer explainers and between those of the 2-layer’s, 4-layer’s, and 6-layer’s Gradient explainers.

The next research question tests whether there exists a difference in the trustworthiness of explainers shaping 16 or 32-feature explanations. Again t-tests were conducted to verify this. Table 6 lists the resulting p -values. Again, there is no statistically significant difference in the average precisions, recalls, and F1-scores of both explainers.

This leads to investigating whether the “Kernel,” “Deep,” and “Grad” explainers are equivalent. Table 6 exhibits the results of the t-tests conducted to verify this and reveals that none of the explainers produce a statistically significantly better performance than the other.

Armed with this evidence, it is now possible to verify whether deeper MLP neural networks produce more trustworthy explanation models. For this purpose, the performance of the “Grad” explainer for each type of predictive model will be compared against each other. The same methodology as previously applied is employed here. Table 6 , again, confirms that the explanation model of the 2-layer predictive model is statistically significantly less trustworthy than the 4-layer’s explanation model; the same can be said of the 4-layer and 6-layer models. The only exception is the difference in average precision between 2-layer and 4-layer models and between 4-layer and 6-layer models; however, there clearly exists a statistically significant difference in terms of precision (and also recall and F1-score) between 2-layer and 6-layer models.

The Best Subset of Essays to Judge AES Relevancy

Table 7 lists the four best essays optimizing feature coverage (93.9%) along with their resolved and predicted scores. Notice how two of the four essays were picked by the adapted SP-LIME algorithm with some strong disagreement between the human and the machine graders, two were picked with short and trivial text, and two were picked exhibiting perfect agreement between the human and machine graders. Interestingly, each pair of longer and shorter essays exposes both strong agreement and strong disagreement between the human and AI agents, offering an opportunity to debug the model and evaluate its ability to detect the presence or absence of more basic (e.g., very small number of words, occurrences of sentence fragments) and more advanced aspects (e.g., cohesion between adjacent sentences, variety of sentence structures) of narrative essay writing and to appropriately reward or penalize them.

www.frontiersin.org

Table 7. Set of best essays to evaluate the correctness of the 6-layer ensemble AES model.

Local Explanation: The Decision Plot

The decision plot lists writing features by order of importance from top to bottom. The line segments display the contribution (SHAP value) of each feature to the predicted rubric score. Note that an actual decision plot consists of all 282 features and that only the top portion of it (20 most important features) can be displayed (see Figure 3 ). A decision plot is read from bottom to top. The line starts at the base value and ends at the predicted rubric score. Given that the “Grad” explainer is the only explainer common to all predictive models, it has been selected to derive all explanations. The decision plots in Figure 3 show the explanations of the four essays in Table 7 ; the dashed line in these plots represents the explanation of the most accurate predictive model, that is the ensemble model with 6 hidden layers which also produced the most trustworthy explanation model. The predicted rubric score of each explanation model is listed in the bottom-right legend. Explanation of the writing features follow in a next subsection.

www.frontiersin.org

Figure 3. Comparisons of all models’ explanations of the most representative set of four essays: (A) Essay 228, (B) Essay 68, (C) Essay 219, and (D) Essay 124.

Global Explanation: The Summary Plot

It is advantageous to use SHAP to build explanation models because it provides a single framework to discover the writing features that are important to an individual essay (local) or a set of essays (global). While the decision plots list features of local importance, Figure 4 ’s summary plot ranks writing features by order of global importance (from top to bottom). All testing set’s 314 essays are represented as dots in the scatterplot of each writing feature. The position of a dot on the horizontal axis corresponds to the importance (SHAP value) of the writing feature for a specific essay and its color indicates the magnitude of the feature value in relation to the range of all 314 feature values. For example, large or small numbers of words within an essay generally contribute to increase or decrease rubric scores by up to 1.5 and 1.0, respectively. Decision plots can also be used to find the most important features for a small subset of essays; Figure 5 demonstrates the new ordering of writing indices when aggregating the feature contributions (summing the absolute values of SHAP values) of the four essays in Table 7 . Moreover, Figure 5 allows to compare the contributions of a feature to various essays. Note how the orderings in Figures 3 −5 can differ from each other, sharing many features of global importance as well as having their own unique features of local importance.

www.frontiersin.org

Figure 4. Summary plot listing the 32 most important features globally.

www.frontiersin.org

Figure 5. Decision plot delineating the best model’s explanations of Essays 228, 68, 219, and 124 (6-layer ensemble).

Definition of Important Writing Indices

The reader shall understand that it is beyond the scope of this paper to make a thorough description of all writing features. Nevertheless, the summary and decision plots in Figures 4 , 5 allow to identify a subset of features that should be examined in order to validate this study’s predictive model. Supplementary Table 1 combines and describes the 38 features in Figures 4 , 5 .

Dependence Plots

Although the summary plot in Figure 4 is insightful to determine whether small or large feature values are desirable, the dependence plots in Figure 6 prove essential to recommend whether a student should aim at increasing or decreasing the value of a specific writing feature. The dependence plots also reveal whether the student should directly act upon the targeted writing feature or indirectly on other features. The horizontal axis in each of the dependence plots in Figure 6 is the scale of the writing feature and the vertical axis is the scale of the writing feature’s contributions to the predicted rubric scores. Each dot in a dependence plot represents one of the testing set’s 314 essays, that is, the feature value and SHAP value belonging to the essay. The vertical dispersion of the dots on small intervals of the horizontal axis is indicative of interaction with other features ( Molnar, 2020 ). If the vertical dispersion is widespread (e.g., the [50, 100] horizontal-axis interval in the “word_count” dependence plot), then the contribution of the writing feature is most likely at some degree dependent on other writing feature(s).

www.frontiersin.org

Figure 6. Dependence plots: the horizontal axes represent feature values while vertical axes represent feature contributions (SHAP values). Each dot represents one of the 314 essays of the testing set and is colored according to the value of the feature with which it interacts most strongly. (A) word_count. (B) hdd42_aw. (C) ncomp_stdev. (D) dobj_per_cl. (E) grammar. (F) SENTENCE_FRAGMENT. (G) Sv_GI. (H) adjacent_overlap_verb_sent.

The contributions of this paper can be summarized as follows: (1) it proposes a means (SHAP) to explain individual predictions of AES systems and provides flexible guidelines to build powerful predictive models using more complex algorithms such as ensembles and deep learning neural networks; (2) it applies a methodology to quantitatively assess the trustworthiness of explanation models; (3) it tests whether faster SHAP implementations impact the descriptive accuracy of explanation models, giving insight on the applicability of SHAP in real pedagogical contexts such as AES; (4) it offers a toolkit to debug AES models, highlights linguistic intricacies, and underscores the means to offer formative feedback to novice writers; and more importantly, (5) it empowers learning analytics practitioners to make AI pedagogical agents accountable to the human educator, the ultimate problem holder responsible for the decisions and actions of AI ( Abbass, 2019 ). Basically, learning analytics (which encompasses tools such as AES) is characterized as an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that recurrently measures and proactively advances knowledge boundaries in human learning.

To exemplify this, imagine an AES system that supports instructors in the detection of plagiarism, gaming behaviors, and the marking of writing activities. As previously mentioned, essays are marked according to a grid of scoring rubrics: ideas, organization, style, and conventions. While an abundance of data (e.g., the 1592 writing metrics) can be collected by the AES tool, these data might still be insufficient to automate the scoring process of certain rubrics (e.g., ideas). Nevertheless, some scoring subtasks such as assessing a student’s vocabulary, sentence fluency, and conventions might still be assigned to AI since the data types available through existing automatic linguistic analysis tools prove sufficient to reliably alleviate the human marker’s workload. Interestingly, learning analytics is key for the accountability of AI agents to the human problem holder. As the volume of writing data (through a large student population, high-frequency capture of learning episodes, and variety of big learning data) accumulate in the system, new AI agents (predictive models) may apply for the job of “automarker.” These AI agents can be quite transparent through XAI ( Arrieta et al., 2020 ) explanation models, and a human instructor may assess the suitability of an agent for the job and hire the candidate agent that comes closest to human performance. Explanations derived from these models could serve as formative feedback to the students.

The AI marker can be assigned to assess the writing activities that are similar to those previously scored by the human marker(s) from whom it learns. Dissimilar and unseen essays can be automatically assigned to the human marker for reliable scoring, and the AI agent can learn from this manual scoring. To ensure accountability, students should be allowed to appeal the AI agent’s marking to the human marker. In addition, the human marker should be empowered to monitor and validate the scoring of select writing rubrics scored by the AI marker. If the human marker does not agree with the machine scores, the writing assignments may be flagged as incorrectly scored and re-assigned to a human marker. These flagged assignments may serve to update predictive models. Moreover, among the essays that are assigned to the machine marker, a small subset can be simultaneously assigned to the human marker for continuous quality control; that is, to continue comparing whether the agreement level between human and machine markers remains within an acceptable threshold. The human marker should be at any time able to “fire” an AI marker or “hire” an AI marker from a pool of potential machine markers.

This notion of a human-AI fusion has been observed in previous AES systems where the human marker’s workload has been found to be significantly alleviated, passing from scoring several hundreds of essays to just a few dozen ( Dronen et al., 2015 ; Hellman et al., 2019 ). As the AES technology matures and as the learning analytics tools continue to penetrate the education market, this alliance of semi-autonomous human and AI agents will lead to better evidence-based/informed pedagogy ( Nelson and Campbell, 2017 ). Such a human-AI alliance can also be guided to autonomously self-regulate its own hypothesis-authoring and data-acquisition processes for purposes of measuring and advancing knowledge boundaries in human learning.

Real-Time Formative Pedagogical Feedback

This paper provides the evidence that deep learning and SHAP can be used not only to score essays automatically but also to offer explanations in real-time. More specifically, the processing time to derive the 314 explanations of the testing set’s essays has been benchmarked for several types of explainers. It was found that the faster DeepSHAP and GradientSHAP implementations, which took only a few seconds of processing, did not produce less accurate explanations than the much slower KernelSHAP. KernelSHAP took approximately 8 h of processing to derive the explanation model of a 2-layer MLP neural network predictive model and 16 h for the 6-layer predictive model.

This finding also holds for various configurations of KernelSHAP, where the number of features (16 vs. 32) shaping the explanation (where all other features are assigned zero contributions) did not produce a statistically significant difference in the reliability of the explanation models. On average, the models had a precision between 63.9 and 64.1% and a recall between 41.0 and 42.9%. This means that after perturbation of the predictive and explanation models, on average 64% of the predictions the explanation model identified as changing were accurate. On the other side, only about 42% of all predictions that changed were detected by the various 2-layer explainers. An explanation was considered as untrustworthy if the sum of its feature contributions, when added to the average prediction (base value), was not within 0.1 from the perturbated prediction. Similarly, the average precision and recall of 2-layer explainers for the 0.25-threshold were about 69% and 62%, respectively.

Impact of Deep Learning on Descriptive Accuracy of Explanations

By analyzing the performance of the various predictive models in Table 4 , no clear conclusion can be reached as to which model should be deemed as the most desirable. Despite the fact that the 6-layer models slightly outperform the other models in terms of accuracy (percentage of exact matches between the resolved [human] and predicted [machine] scores), they are not the best when it comes to the percentages of adjacent (within 1 and 2) matches. Nevertheless, if the selection of the “best” model is based on the quadratic weighted kappas, the decision remains a nebulous one to make. Moreover, ensuring that machine learning actually learned something meaningful remains paramount, especially in contexts where the performance of a majority classifier is close to the human and machine performance. For example, a majority classifier model would get 46.3% of predictions accurate ( Table 4 ), while trained predictive models at best produce accurate predictions between 51.9 and 55.1%.

Since the interpretability of a machine learning model should be prioritized over accuracy ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ) for questions of transparency and trust, this paper investigated whether the impact of the depth of a MLP neural network might be more visible when assessing its interpretability, that is, the trustworthiness of its corresponding SHAP explanation model. The data in Tables 1 , 5 , 6 effectively support the hypothesis that as the depth of the neural network increases, the precision and recall of the corresponding explanation model improve. Besides, this observation is particularly interesting because the 4-layer (Grad) explainer, which has hardly more parameters than the 2-layer model, is also more accurate than the 2-layer model, suggesting that the 6-layer explainer is most likely superior to other explainers not only because of its greater number of parameters, but also because of its number of hidden layers. By increasing the number of hidden layers, it can be seen that the precision and recall of an explanation model can pass on average from approximately 64 to 73% and from 42 to 52%, respectively, for the 0.10-threshold; and for the 0.25-threshold, from 69 to 79% and from 62 to 75%, respectively.

These results imply that the descriptive accuracy of an explanation model is an evidence of effective machine learning, which may exceed the level of agreement between the human and machine graders. Moreover, given that the superiority of a trained predictive model over a majority classifier is not always obvious, the consistency of its associated explanation model demonstrates this better. Note that theoretically the SHAP explanation model of the majority classifier should assign a zero contribution to each writing feature since the average prediction of such a model is actually the most frequent rubric score given by the human raters; hence, the base value is the explanation.

An interesting fact emerges from Figure 3 , that is, all explainers (2-layer to 6-layer) are more or less similar. It appears that they do not contradict each other. More specifically, they all agree on the direction of the contributions of the most important features. In other words, they unanimously determine that a feature should increase or decrease the predicted score. However, they differ from each other on the magnitude of the feature contributions.

To conclude, this study highlights the need to train predictive models that consider the descriptive accuracy of explanations. The idea is that explanation models consider predictions to derive explanations; explanations should be considered when training predictive models. This would not only help train interpretable models the very first time but also potentially break the status quo that may exist among similar explainers to possibly produce more powerful models. In addition, this research calls for a mechanism (e.g., causal diagrams) to allow teachers to guide the training process of predictive models. Put another way, as LA practitioners debug predictive models, their insights should be encoded in a language that will be understood by the machine and that will guide the training process to avoid learning the same errors and to accelerate the training time.

Accountable AES

Now that the superiority of the 6-layer predictive and explanation models has been demonstrated, some aspects of the relevancy of explanations should be examined more deeply, knowing that having an explanation model consistent with its underlying predictive model does not guarantee relevant explanations. Table 7 discloses the set of four essays that optimize the coverage of most globally important features to evaluate the correctness of the best AES model. It is quite intriguing to note that two of the four essays are among the 16 essays that have a major disagreement (off by 2) between the resolved and predicted rubric scores (1 vs. 3 and 4 vs. 2). The AES tool clearly overrated Essay 228, while it underrated Essay 219. Naturally, these two essays offer an opportunity to understand what is wrong with the model and ultimately debug the model to improve its accuracy and interpretability.

In particular, Essay 228 raises suspicion on the positive contributions of features such as “Ortho_N,” “lemma_mattr,” “all_logical,” “det_pobj_deps_struct,” and “dobj_per_cl.” Moreover, notice how the remaining 262 less important features (not visible in the decision plot in Figure 5 ) have already inflated the rubric score beyond the base value, more than any other essay. Given the very short length and very low quality of the essay, whose meaning is seriously undermined by spelling and grammatical errors, it is of utmost importance to verify how some of these features are computed. For example, is the average number of orthographic neighbors (Ortho_N) per token computed for unmeaningful tokens such as “R” and “whe”? Similarly, are these tokens considered as types in the type-token ratio over lemmas (lemma_mattr)? Given the absence of a meaningful grammatical structure conveying a complete idea through well-articulated words, it becomes obvious that the quality of NLP (natural language processing) parsing may become a source of (measurement) bias impacting both the way some writing features are computed and the predicted rubric score. To remedy this, two solutions are proposed: (1) enhancing the dataset with the part-of-speech sequence or the structure of dependency relationships along with associated confidence levels, or (2) augmenting the essay dataset with essays enclosing various types of non-sensical content to improve the learning of these feature contributions.

Note that all four essays have a text length smaller than the average: 171 words. Notice also how the “hdd42_aw” and “hdd42_fw” play a significant role to decrease the predicted score of Essays 228 and 68. The reader should note that these metrics require a minimum of 42 tokens in order to compute a non-zero D index, a measure of lexical diversity as explained in Supplementary Table 1 . Figure 6B also shows how zero “hdd42_aw” values are heavily penalized. This is extra evidence that supports the strong role that the number of words plays in determining these rubric scores, especially for very short essays where it is one of the few observations that can be reliably recorded.

Two other issues with the best trained AES model were identified. First, in the eyes of the model, the lowest the average number of direct objects per clause (dobj_per_cl), as seen in Figure 6D , the best it is. This appears to contradict one of the requirements of the “Style” rubric, which looks for a variety of sentence structures. Remember that direct objects imply the presence of transitive verbs (action verbs) and that the balanced usage of linking verbs and action verbs as well as of transitive and intransitive verbs is key to meet the requirement of variety of sentence structures. Moreover, note that the writing feature is about counting the number of direct objects per clause, not by sentence. Only one direct object is therefore possible per clause. On the other side, a sentence may contain several clauses, which determines if the sentence is a simple, compound, or a complex sentence. This also means that a sentence may have multiple direct objects and that a high ratio of direct objects per clause is indicative of sentence complexity. Too much complexity is also undesirable. Hence, it is fair to conclude that the higher range of feature values has reasonable feature contributions (SHAP values), while the lower range does not capture well the requirements of the rubric. The dependence plot should rather display a positive peak somewhere in the middle. Notice how the poor quality of Essay 228’s single sentence prevented the proper detection of the single direct object, “broke my finger,” and the so-called absence of direct objects was one of the reasons to wrongfully improve the predicted rubric score.

The model’s second issue discussed here is the presence of sentence fragments, a type of grammatical errors. Essentially, a sentence fragment is a clause that misses one of three critical components: a subject, a verb, or a complete idea. Figure 6E shows the contribution model of grammatical errors, all types combined, while Figure 6F shows specifically the contribution model of sentence fragments. It is interesting to see how SHAP further penalizes larger numbers of grammatical errors and that it takes into account the length of the essay (red dots represent essays with larger numbers of words; blue dots represent essays with smaller numbers of words). For example, except for essays with no identified grammatical errors, longer essays are less penalized than shorter ones. This is particularly obvious when there are 2−4 grammatical errors. The model increases the predicted rubric score only when there is no grammatical error. Moreover, the model tolerates longer essays with only one grammatical error, which sounds quite reasonable. On the other side, the model finds desirable high numbers of sentence fragments, a non-trivial type of grammatical errors. Even worse, the model decreases the rubric score of essays having no sentence fragment. Although grammatical issues are beyond the scope of the “Style” rubric, the model has probably included these features because of their impact on the quality of assessment of vocabulary usage and sentence fluency. The reader should observe how the very poor quality of an essay can even prevent the detection of such fundamental grammatical errors such as in the case of Essay 228, where the AES tool did not find any grammatical error or sentence fragment. Therefore, there should be a way for AES systems to detect a minimum level of text quality before attempting to score an essay. Note that the objective of this section was not to undertake thorough debugging of the model, but rather to underscore the effectiveness of SHAP in doing so.

Formative Feedback

Once an AES model is considered reasonably valid, SHAP can be a suitable formalism to empower the machine to provide formative feedback. For instance, the explanation of Essay 124, which has been assigned a rubric score of 3 by both human and machine markers, indicates that the top two factors contributing to decreasing the predicted rubric score are: (1) the essay length being smaller than average, and (2) the average number of verb lemma types occurring at least once in the next sentence (adjacent_overlap_verb_sent). Figures 6A,H give the overall picture in which the realism of the contributions of these two features can be analyzed. More specifically, Essay 124 is one of very few essays ( Figure 6H ) that makes redundant usage of the same verbs across adjacent sentences. Moreover, the essay displays poor sentence fluency where everything is only expressed in two sentences. To understand more accurately the impact of “adjacent_overlap_verb_sent” on the prediction, a few spelling errors have been corrected and the text has been divided in four sentences instead of two. Revision 1 in Table 8 exhibits the corrections made to the original essay. The decision plot’s dashed line in Figure 3D represents the original explanation of Essay 124, while Figure 7A demonstrates the new explanation of the revised essay. It can be seen that the “adjacent_overlap_verb_sent” feature is still the second most important feature in the new explanation of Essay 124, with a feature value of 0.429, still considered as very poor according to the dependence plot in Figure 6H .

www.frontiersin.org

Table 8. Revisions of Essay 124: improvement of sentence splitting, correction of some spelling errors, and elimination of redundant usage of same verbs (bold for emphasis in Essay 124’s original version; corrections in bold for Revisions 1 and 2).

www.frontiersin.org

Figure 7. Explanations of the various versions of Essay 124 and evaluation of feature effect for a range of feature values. (A) Explanation of Essay 124’s first revision. (B) Forecasting the effect of changing the ‘adjacent_overlap_verb_sent’ feature on the rubric score. (C) Explanation of Essay 124’s second revision. (D) Comparison of the explanations of all Essay 124’s versions.

To show how SHAP could be leveraged to offer remedial formative feedback, the revised version of Essay 124 will be explained again for eight different values of “adjacent_overlap_verb_sent” (0, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.0), while keeping the values of all other features constant. The set of these eight essays are explained by a newly trained SHAP explainer (Gradient), producing new SHAP values for each feature and each “revised” essay. Notice how the new model, called the feedback model, allows to foresee by how much a novice writer can hope to improve his/her score according to the “Style” rubric. If the student employs different verbs at every sentence, the feedback model estimates that the rubric score could be improved from 3.47 up to 3.65 ( Figure 7B ). Notice that the dashed line represents Revision 1, while other lines simulate one of the seven other altered essays. Moreover, it is important to note how changing the value of a single feature may influence the contributions that other features may have on the predicted score. Again, all explanations look similar in terms of direction, but certain features differ in terms of the magnitude of their contributions. However, the reader should observe how the targeted feature varies not only in terms of magnitude, but also of direction, allowing the student to ponder the relevancy of executing the recommended writing strategy.

Thus, upon receiving this feedback, assume that a student sets the goal to improve the effectiveness of his/her verb choice by eliminating any redundant verb, producing Revision 2 in Table 8 . The student submits his essay again to the AES system, which finally gives a new rubric score of 3.98, a significant improvement from the previous 3.47, allowing the student to get a 4 instead of a 3. Figure 7C exhibits the decision plot of Revision 2. To better observe how the various revisions of the student’s essay changed over time, their respective explanations have been plotted in the same decision plot ( Figure 7D ). Notice this time that the ordering of the features has changed to list the features of common importance to all of the essay’s versions. The feature ordering in Figures 7A−C complies with the same ordering as in Figure 3D , the decision plot of the original essay. These figures underscore the importance of tracking the interaction between the various features so that the model understands well the impact that changing one feature has on the others. TreeSHAP, an implementation for tree-based models, offers this capability and its potential on improving the quality of feedback provided to novice writers will be tested in a future version of this AES system.

This paper serves as a proof of concept of the applicability of XAI techniques in automated essay scoring, providing learning analytics practitioners and educators with a methodology on how to “hire” AI markers and make them accountable to their human counterparts. In addition to debug predictive models, SHAP explanation models can serve as some formalism of a broader learning analytics platform, where aspects of prescriptive analytics (provision of remedial formative feedback) can be added on top of the more pervasive predictive analytics.

However, the main weakness of the approach put forward in this paper consists in omitting many types of spatio-temporal data. In other words, it ignores precious information inherent to the writing process, which may prove essential to guess the intent of the student, especially in contexts of poor sentence structures and high grammatical inaccuracy. Hence, this paper calls for adapting current NLP technologies to educational purposes, where the quality of writing may be suboptimal, which is contrary to many utopian scenarios where NLP is used for content analysis, opinion mining, topic modeling, or fact extraction trained on corpora of high-quality texts. By capturing the writing process preceding a submission of an essay to an AES tool, other kinds of explanation models can also be trained to offer feedback not only from a linguistic perspective but also from a behavioral one (e.g., composing vs. revising); that is, the AES system could inform novice writers about suboptimal and optimal writing strategies (e.g., planning a revision phase after bursts of writing).

In addition, associating sections of text with suboptimal writing features, those whose contributions lower the predicted score, would be much more informative. This spatial information would not only allow to point out what is wrong and but also where it is wrong, answering more efficiently the question why an essay is wrong. This problem could be simply approached through a multiple-inputs and mixed-data feature-based (MLP) neural network architecture fed by both linguistic indices and textual data ( n -grams), where the SHAP explanation model would assign feature contributions to both types of features and any potential interaction between them. A more complex approach could address the problem through special types of recurrent neural networks such as Ordered-Neurons LSTMs (long short-term memory), which are well adapted to the parsing of natural language, and where the natural sequence of text is not only captured but also its hierarchy of constituents ( Shen et al., 2018 ). After all, this paper highlights the fact that the potential of deep learning can reach beyond the training of powerful predictive models and be better visible in the higher trustworthiness of explanation models. This paper also calls for optimizing the training of predictive models by considering the descriptive accuracy of explanations and the human expert’s qualitative knowledge (e.g., indicating the direction of feature contributions) during the training process.

Data Availability Statement

The datasets and code of this study can be found in these Open Science Framework’s online repositories: https://osf.io/fxvru/ .

Author Contributions

VK architected the concept of an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that measures and advances knowledge boundaries in human learning, which essentially defines the key traits of learning analytics. DB was responsible for its implementation in the area of explainable automated essay scoring and for the training and validation of the predictive and explanation models. Together they offer an XAI-based proof of concept of a prescriptive model that can offer real-time formative remedial feedback to novice writers. Both authors contributed to the article and approved its publication.

Research reported in this article was supported by the Academic Research Fund (ARF) publication grant of Athabasca University under award number (24087).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2020.572367/full#supplementary-material

  • ^ https://www.kaggle.com/c/asap-aes
  • ^ https://www.linguisticanalysistools.org/

Abbass, H. A. (2019). Social integration of artificial intelligence: functions, automation allocation logic and human-autonomy trust. Cogn. Comput. 11, 159–171. doi: 10.1007/s12559-018-9619-0

CrossRef Full Text | Google Scholar

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi: 10.1109/ACCESS.2018.2870052

Amorim, E., Cançado, M., and Veloso, A. (2018). “Automated essay scoring in the presence of biased ratings,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , New Orleans, LA, 229–237.

Google Scholar

Arrieta, A. B., Díaz-Rodríguez, N., Ser, J., Del Bennetot, A., Tabik, S., Barbado, A., et al. (2020). Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inform. Fusion 58, 82–115. doi: 10.1016/j.inffus.2019.12.012

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., et al. (2007). The English lexicon project. Behav. Res. Methods 39, 445–459. doi: 10.3758/BF03193014

PubMed Abstract | CrossRef Full Text | Google Scholar

Boulanger, D., and Kumar, V. (2018). “Deep learning in automated essay scoring,” in Proceedings of the International Conference of Intelligent Tutoring Systems , eds R. Nkambou, R. Azevedo, and J. Vassileva (Cham: Springer International Publishing), 294–299. doi: 10.1007/978-3-319-91464-0_30

Boulanger, D., and Kumar, V. (2019). “Shedding light on the automated essay scoring process,” in Proceedings of the International Conference on Educational Data Mining , 512–515.

Boulanger, D., and Kumar, V. (2020). “SHAPed automated essay scoring: explaining writing features’ contributions to English writing organization,” in Intelligent Tutoring Systems , eds V. Kumar and C. Troussas (Cham: Springer International Publishing), 68–78. doi: 10.1007/978-3-030-49663-0_10

Chen, H., Lundberg, S., and Lee, S.-I. (2019). Explaining models by propagating Shapley values of local components. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1911.11888 (accessed September 22, 2020).

Crossley, S. A., Bradfield, F., and Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. J. Writ. Res. 11, 251–270. doi: 10.17239/jowr-2019.11.02.01

Crossley, S. A., Kyle, K., and McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behav. Res. Methods 48, 1227–1237. doi: 10.3758/s13428-015-0651-7

Crossley, S. A., Kyle, K., and McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social-order analysis. Behav. Res. Methods 49, 803–821. doi: 10.3758/s13428-016-0743-z

Dronen, N., Foltz, P. W., and Habermehl, K. (2015). “Effective sampling for large-scale automated writing evaluation systems,” in Proceedings of the Second (2015) ACM Conference on Learning @ Scale , 3–10.

Goldin, I., Narciss, S., Foltz, P., and Bauer, M. (2017). New directions in formative feedback in interactive learning environments. Int. J. Artif. Intellig. Educ. 27, 385–392. doi: 10.1007/s40593-016-0135-7

Hao, Q., and Tsikerdekis, M. (2019). “How automated feedback is delivered matters: formative feedback and knowledge transfer,” in Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE) , Covington, KY, 1–6.

Hellman, S., Rosenstein, M., Gorman, A., Murray, W., Becker, L., Baikadi, A., et al. (2019). “Scaling up writing in the curriculum: batch mode active learning for automated essay scoring,” in Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale , (New York, NY: Association for Computing Machinery).

Hussein, M. A., Hassan, H., and Nassef, M. (2019). Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5:e208. doi: 10.7717/peerj-cs.208

Kumar, V., and Boulanger, D. (2020). Automated essay scoring and the deep learning black box: how are rubric scores determined? Int. J. Artif. Intellig. Educ. doi: 10.1007/s40593-020-00211-5

Kumar, V., Fraser, S. N., and Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. J. Writ. Anal. 1, 176–226.

Kyle, K. (2016). Measuring Syntactic Development In L2 Writing: Fine Grained Indices Of Syntactic Complexity And Usage-Based Indices Of Syntactic Sophistication. Dissertation, Georgia State University, Atlanta, GA.

Kyle, K., Crossley, S., and Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): version 2.0. Behav. Res. Methods 50, 1030–1046. doi: 10.3758/s13428-017-0924-4

Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1802.03888 (accessed September 22, 2020).

Lundberg, S. M., and Lee, S.-I. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems , eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, et al. (Red Hook, NY: Curran Associates, Inc), 4765–4774.

Madnani, N., and Cahill, A. (2018). “Automated scoring: beyond natural language processing,” in Proceedings of the 27th International Conference on Computational Linguistics , (Santa Fe: Association for Computational Linguistics), 1099–1109.

Madnani, N., Loukina, A., von Davier, A., Burstein, J., and Cahill, A. (2017). “Building better open-source tools to support fairness in automated scoring,” in Proceedings of the First (ACL) Workshop on Ethics in Natural Language Processing , (Valencia: Association for Computational Linguistics), 41–52.

McCarthy, P. M., and Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42, 381–392. doi: 10.3758/brm.42.2.381

Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., et al. (2019). “Analytic score prediction and justification identification in automated short answer scoring,” in Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , Florence, 316–325.

Molnar, C. (2020). Interpretable Machine Learning . Abu Dhabi: Lulu

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U.S.A. 116, 22071–22080. doi: 10.1073/pnas.1900654116

Nelson, J., and Campbell, C. (2017). Evidence-informed practice in education: meanings and applications. Educ. Res. 59, 127–135. doi: 10.1080/00131881.2017.1314115

Rahimi, Z., Litman, D., Correnti, R., Wang, E., and Matsumura, L. C. (2017). Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intellig. Educ. 27, 694–728. doi: 10.1007/s40593-017-0143-2

Reinertsen, N. (2018). Why can’t it mark this one? A qualitative analysis of student writing rejected by an automated essay scoring system. English Austral. 53:52.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should i trust you?”: explaining the predictions of any classifier. CoRR, abs/1602.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1602.04938 (accessed September 22, 2020).

Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl. Meas. Educ. 31, 191–214. doi: 10.1080/08957347.2018.1464448

Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., and Köller, O. (2019). Automated essay scoring at scale: a case study in Switzerland and Germany. ETS Res. Rep. Ser. 2019, 1–23. doi: 10.1002/ets2.12249

Shen, Y., Tan, S., Sordoni, A., and Courville, A. C. (2018). Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. CoRR, abs/1810.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1810.09536 (accessed September 22, 2020).

Shermis, M. D. (2014). State-of-the-art automated essay scoring: competition, results, and future directions from a United States demonstration. Assess. Writ. 20, 53–76. doi: 10.1016/j.asw.2013.04.001

Taghipour, K. (2017). Robust Trait-Specific Essay Scoring using Neural Networks and Density Estimators. Dissertation, National University of Singapore, Singapore.

West-Smith, P., Butler, S., and Mayfield, E. (2018). “Trustworthy automated essay scoring without explicit construct validity,” in Proceedings of the 2018 AAAI Spring Symposium Series , (New York, NY: ACM).

Woods, B., Adamson, D., Miel, S., and Mayfield, E. (2017). “Formative essay feedback using predictive scoring models,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (New York, NY: ACM), 2071–2080.

Keywords : explainable artificial intelligence, SHAP, automated essay scoring, deep learning, trust, learning analytics, feedback, rubric

Citation: Kumar V and Boulanger D (2020) Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value. Front. Educ. 5:572367. doi: 10.3389/feduc.2020.572367

Received: 14 June 2020; Accepted: 09 September 2020; Published: 06 October 2020.

Reviewed by:

Copyright © 2020 Kumar and Boulanger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Boulanger, [email protected]

This article is part of the Research Topic

Learning Analytics for Supporting Individualization: Data-informed Adaptation of Learning

Datasets for Automatic Scoring of Educational Text Data

This page provides an overview of freely available automatic free-text scoring datasets from the areas of automatic essay grading and content/short-answer scoring.

Automated Essay Grading

Two Bears Playing Chess

  • Give Me More Feedback I: Annotating Argument Persuasiveness and Related Attributes in Student Essays . Winston Carlile, Nishant Gurrapadi, Zixuan Ke and Vincent Ng Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018. - Data used in this paper is here .
  • Give Me More Feedback II: Annotating Thesis Strength and Related Attributes in Student Essays . Zixuan Ke, Hrishikesh Inamdar, Hui Lin and Vincent Ng. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2019. - Data used in this paper is here .

Advertisement

Advertisement

An automated essay scoring systems: a systematic literature review

  • Published: 23 September 2021
  • Volume 55 , pages 2495–2527, ( 2022 )

Cite this article

  • Dadi Ramesh   ORCID: orcid.org/0000-0002-3967-8914 1 , 2 &
  • Suresh Kumar Sanampudi 3  

31k Accesses

64 Citations

6 Altmetric

Explore all metrics

Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers' student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence.

Similar content being viewed by others

student essay dataset

Automated Essay Scoring Systems

student essay dataset

Automated Essay Scoring System Based on Rubric

Avoid common mistakes on your manuscript.

1 Introduction

Due to COVID 19 outbreak, an online educational system has become inevitable. In the present scenario, almost all the educational institutions ranging from schools to colleges adapt the online education system. The assessment plays a significant role in measuring the learning ability of the student. Most automated evaluation is available for multiple-choice questions, but assessing short and essay answers remain a challenge. The education system is changing its shift to online-mode, like conducting computer-based exams and automatic evaluation. It is a crucial application related to the education domain, which uses natural language processing (NLP) and Machine Learning techniques. The evaluation of essays is impossible with simple programming languages and simple techniques like pattern matching and language processing. Here the problem is for a single question, we will get more responses from students with a different explanation. So, we need to evaluate all the answers concerning the question.

Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. ( 1973 ). PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade the essay. A modified version of the PEG by Shermis et al. ( 2001 ) was released, which focuses on grammar checking with a correlation between human evaluators and the system. Foltz et al. ( 1999 ) introduced an Intelligent Essay Assessor (IEA) by evaluating content using latent semantic analysis to produce an overall score. Powers et al. ( 2002 ) proposed E-rater and Intellimetric by Rudner et al. ( 2006 ) and Bayesian Essay Test Scoring System (BESTY) by Rudner and Liang ( 2002 ), these systems use natural language processing (NLP) techniques that focus on style and content to obtain the score of an essay. The vast majority of the essay scoring systems in the 1990s followed traditional approaches like pattern matching and a statistical-based approach. Since the last decade, the essay grading systems started using regression-based and natural language processing techniques. AES systems like Dong et al. ( 2017 ) and others developed from 2014 used deep learning techniques, inducing syntactic and semantic features resulting in better results than earlier systems.

Ohio, Utah, and most US states are using AES systems in school education, like Utah compose tool, Ohio standardized test (an updated version of PEG), evaluating millions of student's responses every year. These systems work for both formative, summative assessments and give feedback to students on the essay. Utah provided basic essay evaluation rubrics (six characteristics of essay writing): Development of ideas, organization, style, word choice, sentence fluency, conventions. Educational Testing Service (ETS) has been conducting significant research on AES for more than a decade and designed an algorithm to evaluate essays on different domains and providing an opportunity for test-takers to improve their writing skills. In addition, they are current research content-based evaluation.

The evaluation of essay and short answer scoring should consider the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge. Proper assessment of the parameters mentioned above defines the accuracy of the evaluation system. But all these parameters cannot play an equal role in essay scoring and short answer scoring. In a short answer evaluation, domain knowledge is required, like the meaning of "cell" in physics and biology is different. And while evaluating essays, the implementation of ideas with respect to prompt is required. The system should also assess the completeness of the responses and provide feedback.

Several studies examined AES systems, from the initial to the latest AES systems. In which the following studies on AES systems are Blood ( 2011 ) provided a literature review from PEG 1984–2010. Which has covered only generalized parts of AES systems like ethical aspects, the performance of the systems. Still, they have not covered the implementation part, and it’s not a comparative study and has not discussed the actual challenges of AES systems.

Burrows et al. ( 2015 ) Reviewed AES systems on six dimensions like dataset, NLP techniques, model building, grading models, evaluation, and effectiveness of the model. They have not covered feature extraction techniques and challenges in features extractions. Covered only Machine Learning models but not in detail. This system not covered the comparative analysis of AES systems like feature extraction, model building, and level of relevance, cohesion, and coherence not covered in this review.

Ke et al. ( 2019 ) provided a state of the art of AES system but covered very few papers and not listed all challenges, and no comparative study of the AES model. On the other hand, Hussein et al. in ( 2019 ) studied two categories of AES systems, four papers from handcrafted features for AES systems, and four papers from the neural networks approach, discussed few challenges, and did not cover feature extraction techniques, the performance of AES models in detail.

Klebanov et al. ( 2020 ). Reviewed 50 years of AES systems, listed and categorized all essential features that need to be extracted from essays. But not provided a comparative analysis of all work and not discussed the challenges.

This paper aims to provide a systematic literature review (SLR) on automated essay grading systems. An SLR is an Evidence-based systematic review to summarize the existing research. It critically evaluates and integrates all relevant studies' findings and addresses the research domain's specific research questions. Our research methodology uses guidelines given by Kitchenham et al. ( 2009 ) for conducting the review process; provide a well-defined approach to identify gaps in current research and to suggest further investigation.

We addressed our research method, research questions, and the selection process in Sect.  2 , and the results of the research questions have discussed in Sect.  3 . And the synthesis of all the research questions addressed in Sect.  4 . Conclusion and possible future work discussed in Sect.  5 .

2 Research method

We framed the research questions with PICOC criteria.

Population (P) Student essays and answers evaluation systems.

Intervention (I) evaluation techniques, data sets, features extraction methods.

Comparison (C) Comparison of various approaches and results.

Outcomes (O) Estimate the accuracy of AES systems,

Context (C) NA.

2.1 Research questions

To collect and provide research evidence from the available studies in the domain of automated essay grading, we framed the following research questions (RQ):

RQ1 what are the datasets available for research on automated essay grading?

The answer to the question can provide a list of the available datasets, their domain, and access to the datasets. It also provides a number of essays and corresponding prompts.

RQ2 what are the features extracted for the assessment of essays?

The answer to the question can provide an insight into various features so far extracted, and the libraries used to extract those features.

RQ3, which are the evaluation metrics available for measuring the accuracy of algorithms?

The answer will provide different evaluation metrics for accurate measurement of each Machine Learning approach and commonly used measurement technique.

RQ4 What are the Machine Learning techniques used for automatic essay grading, and how are they implemented?

It can provide insights into various Machine Learning techniques like regression models, classification models, and neural networks for implementing essay grading systems. The response to the question can give us different assessment approaches for automated essay grading systems.

RQ5 What are the challenges/limitations in the current research?

The answer to the question provides limitations of existing research approaches like cohesion, coherence, completeness, and feedback.

2.2 Search process

We conducted an automated search on well-known computer science repositories like ACL, ACM, IEEE Explore, Springer, and Science Direct for an SLR. We referred to papers published from 2010 to 2020 as much of the work during these years focused on advanced technologies like deep learning and natural language processing for automated essay grading systems. Also, the availability of free data sets like Kaggle (2012), Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) by Yannakoudakis et al. ( 2011 ) led to research this domain.

Search Strings : We used search strings like “Automated essay grading” OR “Automated essay scoring” OR “short answer scoring systems” OR “essay scoring systems” OR “automatic essay evaluation” and searched on metadata.

2.3 Selection criteria

After collecting all relevant documents from the repositories, we prepared selection criteria for inclusion and exclusion of documents. With the inclusion and exclusion criteria, it becomes more feasible for the research to be accurate and specific.

Inclusion criteria 1 Our approach is to work with datasets comprise of essays written in English. We excluded the essays written in other languages.

Inclusion criteria 2  We included the papers implemented on the AI approach and excluded the traditional methods for the review.

Inclusion criteria 3 The study is on essay scoring systems, so we exclusively included the research carried out on only text data sets rather than other datasets like image or speech.

Exclusion criteria  We removed the papers in the form of review papers, survey papers, and state of the art papers.

2.4 Quality assessment

In addition to the inclusion and exclusion criteria, we assessed each paper by quality assessment questions to ensure the article's quality. We included the documents that have clearly explained the approach they used, the result analysis and validation.

The quality checklist questions are framed based on the guidelines from Kitchenham et al. ( 2009 ). Each quality assessment question was graded as either 1 or 0. The final score of the study range from 0 to 3. A cut off score for excluding a study from the review is 2 points. Since the papers scored 2 or 3 points are included in the final evaluation. We framed the following quality assessment questions for the final study.

Quality Assessment 1: Internal validity.

Quality Assessment 2: External validity.

Quality Assessment 3: Bias.

The two reviewers review each paper to select the final list of documents. We used the Quadratic Weighted Kappa score to measure the final agreement between the two reviewers. The average resulted from the kappa score is 0.6942, a substantial agreement between the reviewers. The result of evolution criteria shown in Table 1 . After Quality Assessment, the final list of papers for review is shown in Table 2 . The complete selection process is shown in Fig. 1 . The total number of selected papers in year wise as shown in Fig. 2 .

figure 1

Selection process

figure 2

Year wise publications

3.1 What are the datasets available for research on automated essay grading?

To work with problem statement especially in Machine Learning and deep learning domain, we require considerable amount of data to train the models. To answer this question, we listed all the data sets used for training and testing for automated essay grading systems. The Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) Yannakoudakis et al. ( 2011 ) developed corpora that contain 1244 essays and ten prompts. This corpus evaluates whether a student can write the relevant English sentences without any grammatical and spelling mistakes. This type of corpus helps to test the models built for GRE and TOFEL type of exams. It gives scores between 1 and 40.

Bailey and Meurers ( 2008 ), Created a dataset (CREE reading comprehension) for language learners and automated short answer scoring systems. The corpus consists of 566 responses from intermediate students. Mohler and Mihalcea ( 2009 ). Created a dataset for the computer science domain consists of 630 responses for data structure assignment questions. The scores are range from 0 to 5 given by two human raters.

Dzikovska et al. ( 2012 ) created a Student Response Analysis (SRA) corpus. It consists of two sub-groups: the BEETLE corpus consists of 56 questions and approximately 3000 responses from students in the electrical and electronics domain. The second one is the SCIENTSBANK(SemEval-2013) (Dzikovska et al. 2013a ; b ) corpus consists of 10,000 responses on 197 prompts on various science domains. The student responses ladled with "correct, partially correct incomplete, Contradictory, Irrelevant, Non-domain."

In the Kaggle (2012) competition, released total 3 types of corpuses on an Automated Student Assessment Prize (ASAP1) (“ https://www.kaggle.com/c/asap-sas/ ” ) essays and short answers. It has nearly 17,450 essays, out of which it provides up to 3000 essays for each prompt. It has eight prompts that test 7th to 10th grade US students. It gives scores between the [0–3] and [0–60] range. The limitations of these corpora are: (1) it has a different score range for other prompts. (2) It uses statistical features such as named entities extraction and lexical features of words to evaluate essays. ASAP +  + is one more dataset from Kaggle. It is with six prompts, and each prompt has more than 1000 responses total of 10,696 from 8th-grade students. Another corpus contains ten prompts from science, English domains and a total of 17,207 responses. Two human graders evaluated all these responses.

Correnti et al. ( 2013 ) created a Response-to-Text Assessment (RTA) dataset used to check student writing skills in all directions like style, mechanism, and organization. 4–8 grade students give the responses to RTA. Basu et al. ( 2013 ) created a power grading dataset with 700 responses for ten different prompts from US immigration exams. It contains all short answers for assessment.

The TOEFL11 corpus Blanchard et al. ( 2013 ) contains 1100 essays evenly distributed over eight prompts. It is used to test the English language skills of a candidate attending the TOFEL exam. It scores the language proficiency of a candidate as low, medium, and high.

International Corpus of Learner English (ICLE) Granger et al. ( 2009 ) built a corpus of 3663 essays covering different dimensions. It has 12 prompts with 1003 essays that test the organizational skill of essay writing, and13 prompts, each with 830 essays that examine the thesis clarity and prompt adherence.

Argument Annotated Essays (AAE) Stab and Gurevych ( 2014 ) developed a corpus that contains 102 essays with 101 prompts taken from the essayforum2 site. It tests the persuasive nature of the student essay. The SCIENTSBANK corpus used by Sakaguchi et al. ( 2015 ) available in git-hub, containing 9804 answers to 197 questions in 15 science domains. Table 3 illustrates all datasets related to AES systems.

3.2 RQ2 what are the features extracted for the assessment of essays?

Features play a major role in the neural network and other supervised Machine Learning approaches. The automatic essay grading systems scores student essays based on different types of features, which play a prominent role in training the models. Based on their syntax and semantics and they are categorized into three groups. 1. statistical-based features Contreras et al. ( 2018 ); Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ) 2. Style-based (Syntax) features Cummins et al. ( 2016 ); Darwish and Mohamed ( 2020 ); Ke et al. ( 2019 ). 3. Content-based features Dong et al. ( 2017 ). A good set of features appropriate models evolved better AES systems. The vast majority of the researchers are using regression models if features are statistical-based. For Neural Networks models, researches are using both style-based and content-based features. The following table shows the list of various features used in existing AES Systems. Table 4 represents all set of features used for essay grading.

We studied all the feature extracting NLP libraries as shown in Fig. 3 . that are used in the papers. The NLTK is an NLP tool used to retrieve statistical features like POS, word count, sentence count, etc. With NLTK, we can miss the essay's semantic features. To find semantic features Word2Vec Mikolov et al. ( 2013 ), GloVe Jeffrey Pennington et al. ( 2014 ) is the most used libraries to retrieve the semantic text from the essays. And in some systems, they directly trained the model with word embeddings to find the score. From Fig. 4 as observed that non-content-based feature extraction is higher than content-based.

figure 3

Usages of tools

figure 4

Number of papers on content based features

3.3 RQ3 which are the evaluation metrics available for measuring the accuracy of algorithms?

The majority of the AES systems are using three evaluation metrics. They are (1) quadrated weighted kappa (QWK) (2) Mean Absolute Error (MAE) (3) Pearson Correlation Coefficient (PCC) Shehab et al. ( 2016 ). The quadratic weighted kappa will find agreement between human evaluation score and system evaluation score and produces value ranging from 0 to 1. And the Mean Absolute Error is the actual difference between human-rated score to system-generated score. The mean square error (MSE) measures the average squares of the errors, i.e., the average squared difference between the human-rated and the system-generated scores. MSE will always give positive numbers only. Pearson's Correlation Coefficient (PCC) finds the correlation coefficient between two variables. It will provide three values (0, 1, − 1). "0" represents human-rated and system scores that are not related. "1" represents an increase in the two scores. "− 1" illustrates a negative relationship between the two scores.

3.4 RQ4 what are the Machine Learning techniques being used for automatic essay grading, and how are they implemented?

After scrutinizing all documents, we categorize the techniques used in automated essay grading systems into four baskets. 1. Regression techniques. 2. Classification model. 3. Neural networks. 4. Ontology-based approach.

All the existing AES systems developed in the last ten years employ supervised learning techniques. Researchers using supervised methods viewed the AES system as either regression or classification task. The goal of the regression task is to predict the score of an essay. The classification task is to classify the essays belonging to (low, medium, or highly) relevant to the question's topic. Since the last three years, most AES systems developed made use of the concept of the neural network.

3.4.1 Regression based models

Mohler and Mihalcea ( 2009 ). proposed text-to-text semantic similarity to assign a score to the student essays. There are two text similarity measures like Knowledge-based measures, corpus-based measures. There eight knowledge-based tests with all eight models. They found the similarity. The shortest path similarity determines based on the length, which shortest path between two contexts. Leacock & Chodorow find the similarity based on the shortest path's length between two concepts using node-counting. The Lesk similarity finds the overlap between the corresponding definitions, and Wu & Palmer algorithm finds similarities based on the depth of two given concepts in the wordnet taxonomy. Resnik, Lin, Jiang&Conrath, Hirst& St-Onge find the similarity based on different parameters like the concept, probability, normalization factor, lexical chains. In corpus-based likeness, there LSA BNC, LSA Wikipedia, and ESA Wikipedia, latent semantic analysis is trained on Wikipedia and has excellent domain knowledge. Among all similarity scores, correlation scores LSA Wikipedia scoring accuracy is more. But these similarity measure algorithms are not using NLP concepts. These models are before 2010 and basic concept models to continue the research automated essay grading with updated algorithms on neural networks with content-based features.

Adamson et al. ( 2014 ) proposed an automatic essay grading system which is a statistical-based approach in this they retrieved features like POS, Character count, Word count, Sentence count, Miss spelled words, n-gram representation of words to prepare essay vector. They formed a matrix with these all vectors in that they applied LSA to give a score to each essay. It is a statistical approach that doesn’t consider the semantics of the essay. The accuracy they got when compared to the human rater score with the system is 0.532.

Cummins et al. ( 2016 ). Proposed Timed Aggregate Perceptron vector model to give ranking to all the essays, and later they converted the rank algorithm to predict the score of the essay. The model trained with features like Word unigrams, bigrams, POS, Essay length, grammatical relation, Max word length, sentence length. It is multi-task learning, gives ranking to the essays, and predicts the score for the essay. The performance evaluated through QWK is 0.69, a substantial agreement between the human rater and the system.

Sultan et al. ( 2016 ). Proposed a Ridge regression model to find short answer scoring with Question Demoting. Question Demoting is the new concept included in the essay's final assessment to eliminate duplicate words from the essay. The extracted features are Text Similarity, which is the similarity between the student response and reference answer. Question Demoting is the number of repeats in a student response. With inverse document frequency, they assigned term weight. The sentence length Ratio is the number of words in the student response, is another feature. With these features, the Ridge regression model was used, and the accuracy they got 0.887.

Contreras et al. ( 2018 ). Proposed Ontology based on text mining in this model has given a score for essays in phases. In phase-I, they generated ontologies with ontoGen and SVM to find the concept and similarity in the essay. In phase II from ontologies, they retrieved features like essay length, word counts, correctness, vocabulary, and types of word used, domain information. After retrieving statistical data, they used a linear regression model to find the score of the essay. The accuracy score is the average of 0.5.

Darwish and Mohamed ( 2020 ) proposed the fusion of fuzzy Ontology with LSA. They retrieve two types of features, like syntax features and semantic features. In syntax features, they found Lexical Analysis with tokens, and they construct a parse tree. If the parse tree is broken, the essay is inconsistent—a separate grade assigned to the essay concerning syntax features. The semantic features are like similarity analysis, Spatial Data Analysis. Similarity analysis is to find duplicate sentences—Spatial Data Analysis for finding Euclid distance between the center and part. Later they combine syntax features and morphological features score for the final score. The accuracy they achieved with the multiple linear regression model is 0.77, mostly on statistical features.

Süzen Neslihan et al. ( 2020 ) proposed a text mining approach for short answer grading. First, their comparing model answers with student response by calculating the distance between two sentences. By comparing the model answer with student response, they find the essay's completeness and provide feedback. In this approach, model vocabulary plays a vital role in grading, and with this model vocabulary, the grade will be assigned to the student's response and provides feedback. The correlation between the student answer to model answer is 0.81.

3.4.2 Classification based Models

Persing and Ng ( 2013 ) used a support vector machine to score the essay. The features extracted are OS, N-gram, and semantic text to train the model and identified the keywords from the essay to give the final score.

Sakaguchi et al. ( 2015 ) proposed two methods: response-based and reference-based. In response-based scoring, the extracted features are response length, n-gram model, and syntactic elements to train the support vector regression model. In reference-based scoring, features such as sentence similarity using word2vec is used to find the cosine similarity of the sentences that is the final score of the response. First, the scores were discovered individually and later combined two features to find a final score. This system gave a remarkable increase in performance by combining the scores.

Mathias and Bhattacharyya ( 2018a ; b ) Proposed Automated Essay Grading Dataset with Essay Attribute Scores. The first concept features selection depends on the essay type. So the common attributes are Content, Organization, Word Choice, Sentence Fluency, Conventions. In this system, each attribute is scored individually, with the strength of each attribute identified. The model they used is a random forest classifier to assign scores to individual attributes. The accuracy they got with QWK is 0.74 for prompt 1 of the ASAS dataset ( https://www.kaggle.com/c/asap-sas/ ).

Ke et al. ( 2019 ) used a support vector machine to find the response score. In this method, features like Agreeability, Specificity, Clarity, Relevance to prompt, Conciseness, Eloquence, Confidence, Direction of development, Justification of opinion, and Justification of importance. First, the individual parameter score obtained was later combined with all scores to give a final response score. The features are used in the neural network to find whether the sentence is relevant to the topic or not.

Salim et al. ( 2019 ) proposed an XGBoost Machine Learning classifier to assess the essays. The algorithm trained on features like word count, POS, parse tree depth, and coherence in the articles with sentence similarity percentage; cohesion and coherence are considered for training. And they implemented K-fold cross-validation for a result the average accuracy after specific validations is 68.12.

3.4.3 Neural network models

Shehab et al. ( 2016 ) proposed a neural network method that used learning vector quantization to train human scored essays. After training, the network can provide a score to the ungraded essays. First, we should process the essay to remove Spell checking and then perform preprocessing steps like Document Tokenization, stop word removal, Stemming, and submit it to the neural network. Finally, the model will provide feedback on the essay, whether it is relevant to the topic. And the correlation coefficient between human rater and system score is 0.7665.

Kopparapu and De ( 2016 ) proposed the Automatic Ranking of Essays using Structural and Semantic Features. This approach constructed a super essay with all the responses. Next, ranking for a student essay is done based on the super-essay. The structural and semantic features derived helps to obtain the scores. In a paragraph, 15 Structural features like an average number of sentences, the average length of sentences, and the count of words, nouns, verbs, adjectives, etc., are used to obtain a syntactic score. A similarity score is used as semantic features to calculate the overall score.

Dong and Zhang ( 2016 ) proposed a hierarchical CNN model. The model builds two layers with word embedding to represents the words as the first layer. The second layer is a word convolution layer with max-pooling to find word vectors. The next layer is a sentence-level convolution layer with max-pooling to find the sentence's content and synonyms. A fully connected dense layer produces an output score for an essay. The accuracy with the hierarchical CNN model resulted in an average QWK of 0.754.

Taghipour and Ng ( 2016 ) proposed a first neural approach for essay scoring build in which convolution and recurrent neural network concepts help in scoring an essay. The network uses a lookup table with the one-hot representation of the word vector of an essay. The final efficiency of the network model with LSTM resulted in an average QWK of 0.708.

Dong et al. ( 2017 ). Proposed an Attention-based scoring system with CNN + LSTM to score an essay. For CNN, the input parameters were character embedding and word embedding, and it has attention pooling layers and used NLTK to obtain word and character embedding. The output gives a sentence vector, which provides sentence weight. After CNN, it will have an LSTM layer with an attention pooling layer, and this final layer results in the final score of the responses. The average QWK score is 0.764.

Riordan et al. ( 2017 ) proposed a neural network with CNN and LSTM layers. Word embedding, given as input to a neural network. An LSTM network layer will retrieve the window features and delivers them to the aggregation layer. The aggregation layer is a superficial layer that takes a correct window of words and gives successive layers to predict the answer's sore. The accuracy of the neural network resulted in a QWK of 0.90.

Zhao et al. ( 2017 ) proposed a new concept called Memory-Augmented Neural network with four layers, input representation layer, memory addressing layer, memory reading layer, and output layer. An input layer represents all essays in a vector form based on essay length. After converting the word vector, the memory addressing layer takes a sample of the essay and weighs all the terms. The memory reading layer takes the input from memory addressing segment and finds the content to finalize the score. Finally, the output layer will provide the final score of the essay. The accuracy of essay scores is 0.78, which is far better than the LSTM neural network.

Mathias and Bhattacharyya ( 2018a ; b ) proposed deep learning networks using LSTM with the CNN layer and GloVe pre-trained word embeddings. For this, they retrieved features like Sentence count essays, word count per sentence, Number of OOVs in the sentence, Language model score, and the text's perplexity. The network predicted the goodness scores of each essay. The higher the goodness scores, means higher the rank and vice versa.

Nguyen and Dery ( 2016 ). Proposed Neural Networks for Automated Essay Grading. In this method, a single layer bi-directional LSTM accepting word vector as input. Glove vectors used in this method resulted in an accuracy of 90%.

Ruseti et al. ( 2018 ) proposed a recurrent neural network that is capable of memorizing the text and generate a summary of an essay. The Bi-GRU network with the max-pooling layer molded on the word embedding of each document. It will provide scoring to the essay by comparing it with a summary of the essay from another Bi-GRU network. The result obtained an accuracy of 0.55.

Wang et al. ( 2018a ; b ) proposed an automatic scoring system with the bi-LSTM recurrent neural network model and retrieved the features using the word2vec technique. This method generated word embeddings from the essay words using the skip-gram model. And later, word embedding is used to train the neural network to find the final score. The softmax layer in LSTM obtains the importance of each word. This method used a QWK score of 0.83%.

Dasgupta et al. ( 2018 ) proposed a technique for essay scoring with augmenting textual qualitative Features. It extracted three types of linguistic, cognitive, and psychological features associated with a text document. The linguistic features are Part of Speech (POS), Universal Dependency relations, Structural Well-formedness, Lexical Diversity, Sentence Cohesion, Causality, and Informativeness of the text. The psychological features derived from the Linguistic Information and Word Count (LIWC) tool. They implemented a convolution recurrent neural network that takes input as word embedding and sentence vector, retrieved from the GloVe word vector. And the second layer is the Convolution Layer to find local features. The next layer is the recurrent neural network (LSTM) to find corresponding of the text. The accuracy of this method resulted in an average QWK of 0.764.

Liang et al. ( 2018 ) proposed a symmetrical neural network AES model with Bi-LSTM. They are extracting features from sample essays and student essays and preparing an embedding layer as input. The embedding layer output is transfer to the convolution layer from that LSTM will be trained. Hear the LSRM model has self-features extraction layer, which will find the essay's coherence. The average QWK score of SBLSTMA is 0.801.

Liu et al. ( 2019 ) proposed two-stage learning. In the first stage, they are assigning a score based on semantic data from the essay. The second stage scoring is based on some handcrafted features like grammar correction, essay length, number of sentences, etc. The average score of the two stages is 0.709.

Pedro Uria Rodriguez et al. ( 2019 ) proposed a sequence-to-sequence learning model for automatic essay scoring. They used BERT (Bidirectional Encoder Representations from Transformers), which extracts the semantics from a sentence from both directions. And XLnet sequence to sequence learning model to extract features like the next sentence in an essay. With this pre-trained model, they attained coherence from the essay to give the final score. The average QWK score of the model is 75.5.

Xia et al. ( 2019 ) proposed a two-layer Bi-directional LSTM neural network for the scoring of essays. The features extracted with word2vec to train the LSTM and accuracy of the model in an average of QWK is 0.870.

Kumar et al. ( 2019 ) Proposed an AutoSAS for short answer scoring. It used pre-trained Word2Vec and Doc2Vec models trained on Google News corpus and Wikipedia dump, respectively, to retrieve the features. First, they tagged every word POS and they found weighted words from the response. It also found prompt overlap to observe how the answer is relevant to the topic, and they defined lexical overlaps like noun overlap, argument overlap, and content overlap. This method used some statistical features like word frequency, difficulty, diversity, number of unique words in each response, type-token ratio, statistics of the sentence, word length, and logical operator-based features. This method uses a random forest model to train the dataset. The data set has sample responses with their associated score. The model will retrieve the features from both responses like graded and ungraded short answers with questions. The accuracy of AutoSAS with QWK is 0.78. It will work on any topics like Science, Arts, Biology, and English.

Jiaqi Lun et al. ( 2020 ) proposed an automatic short answer scoring with BERT. In this with a reference answer comparing student responses and assigning scores. The data augmentation is done with a neural network and with one correct answer from the dataset classifying reaming responses as correct or incorrect.

Zhu and Sun ( 2020 ) proposed a multimodal Machine Learning approach for automated essay scoring. First, they count the grammar score with the spaCy library and numerical count as the number of words and sentences with the same library. With this input, they trained a single and Bi LSTM neural network for finding the final score. For the LSTM model, they prepared sentence vectors with GloVe and word embedding with NLTK. Bi-LSTM will check each sentence in both directions to find semantic from the essay. The average QWK score with multiple models is 0.70.

3.4.4 Ontology based approach

Mohler et al. ( 2011 ) proposed a graph-based method to find semantic similarity in short answer scoring. For the ranking of answers, they used the support vector regression model. The bag of words is the main feature extracted in the system.

Ramachandran et al. ( 2015 ) also proposed a graph-based approach to find lexical based semantics. Identified phrase patterns and text patterns are the features to train a random forest regression model to score the essays. The accuracy of the model in a QWK is 0.78.

Zupanc et al. ( 2017 ) proposed sentence similarity networks to find the essay's score. Ajetunmobi and Daramola ( 2017 ) recommended an ontology-based information extraction approach and domain-based ontology to find the score.

3.4.5 Speech response scoring

Automatic scoring is in two ways one is text-based scoring, other is speech-based scoring. This paper discussed text-based scoring and its challenges, and now we cover speech scoring and common points between text and speech-based scoring. Evanini and Wang ( 2013 ), Worked on speech scoring of non-native school students, extracted features with speech ratter, and trained a linear regression model, concluding that accuracy varies based on voice pitching. Loukina et al. ( 2015 ) worked on feature selection from speech data and trained SVM. Malinin et al. ( 2016 ) used neural network models to train the data. Loukina et al. ( 2017 ). Proposed speech and text-based automatic scoring. Extracted text-based features, speech-based features and trained a deep neural network for speech-based scoring. They extracted 33 types of features based on acoustic signals. Malinin et al. ( 2017 ). Wu Xixin et al. ( 2020 ) Worked on deep neural networks for spoken language assessment. Incorporated different types of models and tested them. Ramanarayanan et al. ( 2017 ) worked on feature extraction methods and extracted punctuation, fluency, and stress and trained different Machine Learning models for scoring. Knill et al. ( 2018 ). Worked on Automatic speech recognizer and its errors how its impacts the speech assessment.

3.4.5.1 The state of the art

This section provides an overview of the existing AES systems with a comparative study w. r. t models, features applied, datasets, and evaluation metrics used for building the automated essay grading systems. We divided all 62 papers into two sets of the first set of review papers in Table 5 with a comparative study of the AES systems.

3.4.6 Comparison of all approaches

In our study, we divided major AES approaches into three categories. Regression models, classification models, and neural network models. The regression models failed to find cohesion and coherence from the essay because it trained on BoW(Bag of Words) features. In processing data from input to output, the regression models are less complicated than neural networks. There are unable to find many intricate patterns from the essay and unable to find sentence connectivity. If we train the model with BoW features in the neural network approach, the model never considers the essay's coherence and coherence.

First, to train a Machine Learning algorithm with essays, all the essays are converted to vector form. We can form a vector with BoW and Word2vec, TF-IDF. The BoW and Word2vec vector representation of essays represented in Table 6 . The vector representation of BoW with TF-IDF is not incorporating the essays semantic, and it’s just statistical learning from a given vector. Word2vec vector comprises semantic of essay in a unidirectional way.

In BoW, the vector contains the frequency of word occurrences in the essay. The vector represents 1 and more based on the happenings of words in the essay and 0 for not present. So, in BoW, the vector does not maintain the relationship with adjacent words; it’s just for single words. In word2vec, the vector represents the relationship between words with other words and sentences prompt in multiple dimensional ways. But word2vec prepares vectors in a unidirectional way, not in a bidirectional way; word2vec fails to find semantic vectors when a word has two meanings, and the meaning depends on adjacent words. Table 7 represents a comparison of Machine Learning models and features extracting methods.

In AES, cohesion and coherence will check the content of the essay concerning the essay prompt these can be extracted from essay in the vector from. Two more parameters are there to access an essay is completeness and feedback. Completeness will check whether student’s response is sufficient or not though the student wrote correctly. Table 8 represents all four parameters comparison for essay grading. Table 9 illustrates comparison of all approaches based on various features like grammar, spelling, organization of essay, relevance.

3.5 What are the challenges/limitations in the current research?

From our study and results discussed in the previous sections, many researchers worked on automated essay scoring systems with numerous techniques. We have statistical methods, classification methods, and neural network approaches to evaluate the essay automatically. The main goal of the automated essay grading system is to reduce human effort and improve consistency.

The vast majority of essay scoring systems are dealing with the efficiency of the algorithm. But there are many challenges in automated essay grading systems. One should assess the essay by following parameters like the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge.

No model works on the relevance of content, which means whether student response or explanation is relevant to the given prompt or not if it is relevant to how much it is appropriate, and there is no discussion about the cohesion and coherence of the essays. All researches concentrated on extracting the features using some NLP libraries, trained their models, and testing the results. But there is no explanation in the essay evaluation system about consistency and completeness, But Palma and Atkinson ( 2018 ) explained coherence-based essay evaluation. And Zupanc and Bosnic ( 2014 ) also used the word coherence to evaluate essays. And they found consistency with latent semantic analysis (LSA) for finding coherence from essays, but the dictionary meaning of coherence is "The quality of being logical and consistent."

Another limitation is there is no domain knowledge-based evaluation of essays using Machine Learning models. For example, the meaning of a cell is different from biology to physics. Many Machine Learning models extract features with WordVec and GloVec; these NLP libraries cannot convert the words into vectors when they have two or more meanings.

3.5.1 Other challenges that influence the Automated Essay Scoring Systems.

All these approaches worked to improve the QWK score of their models. But QWK will not assess the model in terms of features extraction and constructed irrelevant answers. The QWK is not evaluating models whether the model is correctly assessing the answer or not. There are many challenges concerning students' responses to the Automatic scoring system. Like in evaluating approach, no model has examined how to evaluate the constructed irrelevant and adversarial answers. Especially the black box type of approaches like deep learning models provides more options to the students to bluff the automated scoring systems.

The Machine Learning models that work on statistical features are very vulnerable. Based on Powers et al. ( 2001 ) and Bejar Isaac et al. ( 2014 ), the E-rater was failed on Constructed Irrelevant Responses Strategy (CIRS). From the study of Bejar et al. ( 2013 ), Higgins and Heilman ( 2014 ), observed that when student response contain irrelevant content or shell language concurring to prompt will influence the final score of essays in an automated scoring system.

In deep learning approaches, most of the models automatically read the essay's features, and some methods work on word-based embedding and other character-based embedding features. From the study of Riordan Brain et al. ( 2019 ), The character-based embedding systems do not prioritize spelling correction. However, it is influencing the final score of the essay. From the study of Horbach and Zesch ( 2019 ), Various factors are influencing AES systems. For example, there are data set size, prompt type, answer length, training set, and human scorers for content-based scoring.

Ding et al. ( 2020 ) reviewed that the automated scoring system is vulnerable when a student response contains more words from prompt, like prompt vocabulary repeated in the response. Parekh et al. ( 2020 ) and Kumar et al. ( 2020 ) tested various neural network models of AES by iteratively adding important words, deleting unimportant words, shuffle the words, and repeating sentences in an essay and found that no change in the final score of essays. These neural network models failed to recognize common sense in adversaries' essays and give more options for the students to bluff the automated systems.

Other than NLP and ML techniques for AES. From Wresch ( 1993 ) to Madnani and Cahill ( 2018 ). discussed the complexity of AES systems, standards need to be followed. Like assessment rubrics to test subject knowledge, irrelevant responses, and ethical aspects of an algorithm like measuring the fairness of student response.

Fairness is an essential factor for automated systems. For example, in AES, fairness can be measure in an agreement between human score to machine score. Besides this, From Loukina et al. ( 2019 ), the fairness standards include overall score accuracy, overall score differences, and condition score differences between human and system scores. In addition, scoring different responses in the prospect of constructive relevant and irrelevant will improve fairness.

Madnani et al. ( 2017a ; b ). Discussed the fairness of AES systems for constructed responses and presented RMS open-source tool for detecting biases in the models. With this, one can change fairness standards according to their analysis of fairness.

From Berzak et al.'s ( 2018 ) approach, behavior factors are a significant challenge in automated scoring systems. That helps to find language proficiency, word characteristics (essential words from the text), predict the critical patterns from the text, find related sentences in an essay, and give a more accurate score.

Rupp ( 2018 ), has discussed the designing, evaluating, and deployment methodologies for AES systems. They provided notable characteristics of AES systems for deployment. They are like model performance, evaluation metrics for a model, threshold values, dynamically updated models, and framework.

First, we should check the model performance on different datasets and parameters for operational deployment. Selecting Evaluation metrics for AES models are like QWK, correlation coefficient, or sometimes both. Kelley and Preacher ( 2012 ) have discussed three categories of threshold values: marginal, borderline, and acceptable. The values can be varied based on data size, model performance, type of model (single scoring, multiple scoring models). Once a model is deployed and evaluates millions of responses every time for optimal responses, we need a dynamically updated model based on prompt and data. Finally, framework designing of AES model, hear a framework contains prompts where test-takers can write the responses. One can design two frameworks: a single scoring model for a single methodology and multiple scoring models for multiple concepts. When we deploy multiple scoring models, each prompt could be trained separately, or we can provide generalized models for all prompts with this accuracy may vary, and it is challenging.

4 Synthesis

Our Systematic literature review on the automated essay grading system first collected 542 papers with selected keywords from various databases. After inclusion and exclusion criteria, we left with 139 articles; on these selected papers, we applied Quality assessment criteria with two reviewers, and finally, we selected 62 writings for final review.

Our observations on automated essay grading systems from 2010 to 2020 are as followed:

The implementation techniques of automated essay grading systems are classified into four buckets; there are 1. regression models 2. Classification models 3. Neural networks 4. Ontology-based methodology, but using neural networks, the researchers are more accurate than other techniques, and all the methods state of the art provided in Table 3 .

The majority of the regression and classification models on essay scoring used statistical features to find the final score. It means the systems or models trained on such parameters as word count, sentence count, etc. though the parameters extracted from the essay, the algorithm are not directly training on essays. The algorithms trained on some numbers obtained from the essay and hear if numbers matched the composition will get a good score; otherwise, the rating is less. In these models, the evaluation process is entirely on numbers, irrespective of the essay. So, there is a lot of chance to miss the coherence, relevance of the essay if we train our algorithm on statistical parameters.

In the neural network approach, the models trained on Bag of Words (BoW) features. The BoW feature is missing the relationship between a word to word and the semantic meaning of the sentence. E.g., Sentence 1: John killed bob. Sentence 2: bob killed John. In these two sentences, the BoW is "John," "killed," "bob."

In the Word2Vec library, if we are prepared a word vector from an essay in a unidirectional way, the vector will have a dependency with other words and finds the semantic relationship with other words. But if a word has two or more meanings like "Bank loan" and "River Bank," hear bank has two implications, and its adjacent words decide the sentence meaning; in this case, Word2Vec is not finding the real meaning of the word from the sentence.

The features extracted from essays in the essay scoring system are classified into 3 type's features like statistical features, style-based features, and content-based features, which are explained in RQ2 and Table 3 . But statistical features, are playing a significant role in some systems and negligible in some systems. In Shehab et al. ( 2016 ); Cummins et al. ( 2016 ). Dong et al. ( 2017 ). Dong and Zhang ( 2016 ). Mathias and Bhattacharyya ( 2018a ; b ) Systems the assessment is entirely on statistical and style-based features they have not retrieved any content-based features. And in other systems that extract content from the essays, the role of statistical features is for only preprocessing essays but not included in the final grading.

In AES systems, coherence is the main feature to be considered while evaluating essays. The actual meaning of coherence is to stick together. That is the logical connection of sentences (local level coherence) and paragraphs (global level coherence) in a story. Without coherence, all sentences in a paragraph are independent and meaningless. In an Essay, coherence is a significant feature that is explaining everything in a flow and its meaning. It is a powerful feature in AES system to find the semantics of essay. With coherence, one can assess whether all sentences are connected in a flow and all paragraphs are related to justify the prompt. Retrieving the coherence level from an essay is a critical task for all researchers in AES systems.

In automatic essay grading systems, the assessment of essays concerning content is critical. That will give the actual score for the student. Most of the researches used statistical features like sentence length, word count, number of sentences, etc. But according to collected results, 32% of the systems used content-based features for the essay scoring. Example papers which are on content-based assessment are Taghipour and Ng ( 2016 ); Persing and Ng ( 2013 ); Wang et al. ( 2018a , 2018b ); Zhao et al. ( 2017 ); Kopparapu and De ( 2016 ), Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ); Mohler and Mihalcea ( 2009 ) are used content and statistical-based features. The results are shown in Fig. 3 . And mainly the content-based features extracted with word2vec NLP library, but word2vec is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other terms, but word2vec is capable of capturing the context word in a uni-direction either left or right. If a word has multiple meanings, there is a chance of missing the context in the essay. After analyzing all the papers, we found that content-based assessment is a qualitative assessment of essays.

On the other hand, Horbach and Zesch ( 2019 ); Riordan Brain et al. ( 2019 ); Ding et al. ( 2020 ); Kumar et al. ( 2020 ) proved that neural network models are vulnerable when a student response contains constructed irrelevant, adversarial answers. And a student can easily bluff an automated scoring system by submitting different responses like repeating sentences and repeating prompt words in an essay. From Loukina et al. ( 2019 ), and Madnani et al. ( 2017b ). The fairness of an algorithm is an essential factor to be considered in AES systems.

While talking about speech assessment, the data set contains audios of duration up to one minute. Feature extraction techniques are entirely different from text assessment, and accuracy varies based on speaking fluency, pitching, male to female voice and boy to adult voice. But the training algorithms are the same for text and speech assessment.

Once an AES system evaluates essays and short answers accurately in all directions, there is a massive demand for automated systems in the educational and related world. Now AES systems are deployed in GRE, TOEFL exams; other than these, we can deploy AES systems in massive open online courses like Coursera(“ https://coursera.org/learn//machine-learning//exam ”), NPTEL ( https://swayam.gov.in/explorer ), etc. still they are assessing student performance with multiple-choice questions. In another perspective, AES systems can be deployed in information retrieval systems like Quora, stack overflow, etc., to check whether the retrieved response is appropriate to the question or not and can give ranking to the retrieved answers.

5 Conclusion and future work

As per our Systematic literature review, we studied 62 papers. There exist significant challenges for researchers in implementing automated essay grading systems. Several researchers are working rigorously on building a robust AES system despite its difficulty in solving this problem. All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test. Feature extraction is with NLTK, WordVec, and GloVec NLP libraries; these libraries have many limitations while converting a sentence into vector form. Apart from feature extraction and training Machine Learning models, no system is accessing the essay's completeness. No system provides feedback to the student response and not retrieving coherence vectors from the essay—another perspective the constructive irrelevant and adversarial student responses still questioning AES systems.

Our proposed research work will go on the content-based assessment of essays with domain knowledge and find a score for the essays with internal and external consistency. And we will create a new dataset concerning one domain. And another area in which we can improve is the feature extraction techniques.

This study includes only four digital databases for study selection may miss some functional studies on the topic. However, we hope that we covered most of the significant studies as we manually collected some papers published in useful journals.

Adamson, A., Lamb, A., & December, R. M. (2014). Automated Essay Grading.

Ajay HB, Tillett PI, Page EB (1973) Analysis of essays by computer (AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education, National Center for Educational Research and Development

Ajetunmobi SA, Daramola O (2017) Ontology-based information extraction for subject-focussed automatic essay evaluation. In: 2017 International Conference on Computing Networking and Informatics (ICCNI) p 1–6. IEEE

Alva-Manchego F, et al. (2019) EASSE: Easier Automatic Sentence Simplification Evaluation.” ArXiv abs/1908.04567 (2019): n. pag

Bailey S, Meurers D (2008) Diagnosing meaning errors in short answers to reading comprehension questions. In: Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (Columbus), p 107–115

Basu S, Jacobs C, Vanderwende L (2013) Powergrading: a clustering approach to amplify human effort for short answer grading. Trans Assoc Comput Linguist (TACL) 1:391–402

Article   Google Scholar  

Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing, 22, 48-59.

Bejar I, et al. (2013) Length of Textual Response as a Construct-Irrelevant Response Strategy: The Case of Shell Language. Research Report. ETS RR-13-07.” ETS Research Report Series (2013): n. pag

Berzak Y, et al. (2018) “Assessing Language Proficiency from Eye Movements in Reading.” ArXiv abs/1804.07329 (2018): n. pag

Blanchard D, Tetreault J, Higgins D, Cahill A, Chodorow M (2013) TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15, 2013

Blood, I. (2011). Automated essay scoring: a literature review. Studies in Applied Linguistics and TESOL, 11(2).

Burrows S, Gurevych I, Stein B (2015) The eras and trends of automatic short answer grading. Int J Artif Intell Educ 25:60–117. https://doi.org/10.1007/s40593-014-0026-8

Cader, A. (2020, July). The Potential for the Use of Deep Neural Networks in e-Learning Student Evaluation with New Data Augmentation Method. In International Conference on Artificial Intelligence in Education (pp. 37–42). Springer, Cham.

Cai C (2019) Automatic essay scoring with recurrent neural network. In: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications (2019): n. pag.

Chen M, Li X (2018) "Relevance-Based Automated Essay Scoring via Hierarchical Recurrent Model. In: 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 2018, p 378–383, doi: https://doi.org/10.1109/IALP.2018.8629256

Chen Z, Zhou Y (2019) "Research on Automatic Essay Scoring of Composition Based on CNN and OR. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, p 13–18, doi: https://doi.org/10.1109/ICAIBD.2019.8837007

Contreras JO, Hilles SM, Abubakar ZB (2018) Automated essay scoring with ontology based on text mining and NLTK tools. In: 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 1-6

Correnti R, Matsumura LC, Hamilton L, Wang E (2013) Assessing students’ skills at writing analytically in response to texts. Elem Sch J 114(2):142–177

Cummins, R., Zhang, M., & Briscoe, E. (2016, August). Constrained multi-task learning for automated essay scoring. Association for Computational Linguistics.

Darwish SM, Mohamed SK (2020) Automated essay evaluation based on fusion of fuzzy ontology and latent semantic analysis. In: Hassanien A, Azar A, Gaber T, Bhatnagar RF, Tolba M (eds) The International Conference on Advanced Machine Learning Technologies and Applications

Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 93–102

Ding Y, et al. (2020) "Don’t take “nswvtnvakgxpm” for an answer–The surprising vulnerability of automatic content scoring systems to adversarial input." In: Proceedings of the 28th International Conference on Computational Linguistics

Dong F, Zhang Y (2016) Automatic features for essay scoring–an empirical study. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing p 1072–1077

Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) p 153–162

Dzikovska M, Nielsen R, Brew C, Leacock C, Gi ampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013a) Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge

Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Trang Dang H (2013b) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. *SEM 2013: The First Joint Conference on Lexical and Computational Semantics

Educational Testing Service (2008) CriterionSM online writing evaluation service. Retrieved from http://www.ets.org/s/criterion/pdf/9286_CriterionBrochure.pdf .

Evanini, K., & Wang, X. (2013, August). Automated speech scoring for non-native middle school students with multiple task types. In INTERSPEECH (pp. 2435–2439).

Foltz PW, Laham D, Landauer TK (1999) The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1, 2, http://imej.wfu.edu/articles/1999/2/04/ index.asp

Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (Eds.). (2009). International corpus of learner English. Louvain-la-Neuve: Presses universitaires de Louvain.

Higgins, D., & Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(3), 36–46.

Horbach A, Zesch T (2019) The influence of variance in learner answers on automatic content scoring. Front Educ 4:28. https://doi.org/10.3389/feduc.2019.00028

https://www.coursera.org/learn/machine-learning/exam/7pytE/linear-regression-with-multiple-variables/attempt

Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208.

Ke Z, Ng V (2019) “Automated essay scoring: a survey of the state of the art.” IJCAI

Ke, Z., Inamdar, H., Lin, H., & Ng, V. (2019, July). Give me more feedback II: Annotating thesis strength and related attributes in student essays. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3994-4004).

Kelley K, Preacher KJ (2012) On effect size. Psychol Methods 17(2):137–152

Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering–a systematic literature review. Inf Softw Technol 51(1):7–15

Klebanov, B. B., & Madnani, N. (2020, July). Automated evaluation of writing–50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7796–7810).

Knill K, Gales M, Kyriakopoulos K, et al. (4 more authors) (2018) Impact of ASR performance on free speaking language assessment. In: Interspeech 2018.02–06 Sep 2018, Hyderabad, India. International Speech Communication Association (ISCA)

Kopparapu SK, De A (2016) Automatic ranking of essays using structural and semantic features. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), p 519–523

Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019, July). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 9662–9669).

Kumar Y, et al. (2020) “Calling out bluff: attacking the robustness of automatic scoring systems with simple adversarial testing.” ArXiv abs/2007.06796

Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-Based Automated Essay Scoring Using Self-attention. In: Sun M, Liu T, Wang X, Liu Z, Liu Y (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, vol 11221. Springer, Cham. https://doi.org/10.1007/978-3-030-01716-3_32

Liang G, On B, Jeong D, Kim H, Choi G (2018) Automated essay scoring: a siamese bidirectional LSTM neural network architecture. Symmetry 10:682

Liua, H., Yeb, Y., & Wu, M. (2018, April). Ensemble Learning on Scoring Student Essay. In 2018 International Conference on Management and Education, Humanities and Social Sciences (MEHSS 2018). Atlantis Press.

Liu J, Xu Y, Zhao L (2019) Automated Essay Scoring based on Two-Stage Learning. ArXiv, abs/1901.07744

Loukina A, et al. (2015) Feature selection for automated speech scoring.” BEA@NAACL-HLT

Loukina A, et al. (2017) “Speech- and Text-driven Features for Automated Scoring of English-Speaking Tasks.” SCNLP@EMNLP 2017

Loukina A, et al. (2019) The many dimensions of algorithmic fairness in educational applications. BEA@ACL

Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(09): 13389-13396

Madnani, N., & Cahill, A. (2018, August). Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109).

Madnani N, et al. (2017b) “Building better open-source tools to support fairness in automated scoring.” EthNLP@EACL

Malinin A, et al. (2016) “Off-topic response detection for spontaneous spoken english assessment.” ACL

Malinin A, et al. (2017) “Incorporating uncertainty into deep learning for spoken language assessment.” ACL

Mathias S, Bhattacharyya P (2018a) Thank “Goodness”! A Way to Measure Style in Student Essays. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 35–41

Mathias S, Bhattacharyya P (2018b) ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

Mikolov T, et al. (2013) “Efficient Estimation of Word Representations in Vector Space.” ICLR

Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) p 567–575

Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies p 752–762

Muangkammuen P, Fukumoto F (2020) Multi-task Learning for Automated Essay Scoring with Sentiment Analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop p 116–123

Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1–11.

Palma D, Atkinson J (2018) Coherence-based automatic essay assessment. IEEE Intell Syst 33(5):26–36

Parekh S, et al (2020) My Teacher Thinks the World Is Flat! Interpreting Automatic Essay Scoring Mechanism.” ArXiv abs/2012.13872 (2020): n. pag

Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

Persing I, Ng V (2013) Modeling thesis clarity in student essays. In:Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) p 260–269

Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K (2001) Stumping E-Rater: challenging the validity of automated essay scoring. ETS Res Rep Ser 2001(1):i–44

Google Scholar  

Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2), 103–134.

Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications p 97–106

Ramanarayanan V, et al. (2017) “Human and Automated Scoring of Fluency, Pronunciation and Intonation During Human-Machine Spoken Dialog Interactions.” INTERSPEECH

Riordan B, Horbach A, Cahill A, Zesch T, Lee C (2017) Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications p 159–168

Riordan B, Flor M, Pugh R (2019) "How to account for misspellings: Quantifying the benefit of character representations in neural content scoring models."In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Rodriguez P, Jafari A, Ormerod CM (2019) Language models and Automated Essay Scoring. ArXiv, abs/1909.09482

Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes' theorem. The Journal of Technology, Learning and Assessment, 1(2).

Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4).

Rupp A (2018) Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl Meas Educ 31:191–214

Ruseti S, Dascalu M, Johnson AM, McNamara DS, Balyan R, McCarthy KS, Trausan-Matu S (2018) Scoring summaries using recurrent neural networks. In: International Conference on Intelligent Tutoring Systems p 191–201. Springer, Cham

Sakaguchi K, Heilman M, Madnani N (2015) Effective feature integration for automated short answer scoring. In: Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies p 1049–1054

Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., & Suhartono, D. (2019, December). Automated English Digital Essay Grader Using Machine Learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) (pp. 1–6). IEEE.

Shehab A, Elhoseny M, Hassanien AE (2016) A hybrid scheme for Automated Essay Grading based on LVQ and NLP techniques. In: 12th International Computer Engineering Conference (ICENCO), Cairo, 2016, p 65-70

Shermis MD, Mzumara HR, Olson J, Harrington S (2001) On-line grading of student essays: PEG goes on the World Wide Web. Assess Eval High Educ 26(3):247–259

Stab C, Gurevych I (2014) Identifying argumentative discourse structures in persuasive essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) p 46–56

Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies p 1070–1075

Süzen, N., Gorban, A. N., Levesley, J., & Mirkes, E. M. (2020). Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 169, 726–743.

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the 2016 conference on empirical methods in natural language processing p 1882–1891

Tashu TM (2020) "Off-Topic Essay Detection Using C-BGRU Siamese. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, p 221–225, doi: https://doi.org/10.1109/ICSC.2020.00046

Tashu TM, Horváth T (2019) A layered approach to automatic essay evaluation using word-embedding. In: McLaren B, Reilly R, Zvacek S, Uhomoibhi J (eds) Computer Supported Education. CSEDU 2018. Communications in Computer and Information Science, vol 1022. Springer, Cham

Tashu TM, Horváth T (2020) Semantic-Based Feedback Recommendation for Automatic Essay Evaluation. In: Bi Y, Bhatia R, Kapoor S (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham

Uto M, Okano M (2020) Robust Neural Automated Essay Scoring Using Item Response Theory. In: Bittencourt I, Cukurova M, Muldner K, Luckin R, Millán E (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham

Wang Z, Liu J, Dong R (2018a) Intelligent Auto-grading System. In: 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS) p 430–435. IEEE.

Wang Y, et al. (2018b) “Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning.” EMNLP

Zhu W, Sun Y (2020) Automated essay scoring system using multi-model Machine Learning, david c. wyld et al. (eds): mlnlp, bdiot, itccma, csity, dtmn, aifz, sigpro

Wresch W (1993) The Imminence of Grading Essays by Computer-25 Years Later. Comput Compos 10:45–58

Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment.

Xia L, Liu J, Zhang Z (2019) Automatic Essay Scoring Model Based on Two-Layer Bi-directional Long-Short Term Memory Network. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence p 133–137

Yannakoudakis H, Briscoe T, Medlock B (2011) A new dataset and method for automatically grading ESOL texts. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies p 180–189

Zhao S, Zhang Y, Xiong X, Botelho A, Heffernan N (2017) A memory-augmented neural model for automated grading. In: Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale p 189–192

Zupanc K, Bosnic Z (2014) Automated essay evaluation augmented with semantic coherence measures. In: 2014 IEEE International Conference on Data Mining p 1133–1138. IEEE.

Zupanc K, Savić M, Bosnić Z, Ivanović M (2017) Evaluating coherence of essays using sentence-similarity networks. In: Proceedings of the 18th International Conference on Computer Systems and Technologies p 65–72

Dzikovska, M. O., Nielsen, R., & Brew, C. (2012, June). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In  Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies  (pp. 200-210).

Kumar, N., & Dey, L. (2013, November). Automatic Quality Assessment of documents with application to essay grading. In 2013 12th Mexican International Conference on Artificial Intelligence (pp. 216–222). IEEE.

Wu, S. H., & Shih, W. F. (2018, July). A short answer grading system in chinese by support vector approach. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications (pp. 125-129).

Agung Putri Ratna, A., Lalita Luhurkinanti, D., Ibrahim I., Husna D., Dewi Purnamasari P. (2018). Automatic Essay Grading System for Japanese Language Examination Using Winnowing Algorithm, 2018 International Seminar on Application for Technology of Information and Communication, 2018, pp. 565–569. https://doi.org/10.1109/ISEMANTIC.2018.8549789 .

Sharma A., & Jayagopi D. B. (2018). Automated Grading of Handwritten Essays 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp 279–284. https://doi.org/10.1109/ICFHR-2018.2018.00056

Download references

Not Applicable.

Author information

Authors and affiliations.

School of Computer Science and Artificial Intelligence, SR University, Warangal, TS, India

Dadi Ramesh

Research Scholar, JNTU, Hyderabad, India

Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS, India

Suresh Kumar Sanampudi

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dadi Ramesh .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 80 KB)

Rights and permissions.

Reprints and permissions

About this article

Ramesh, D., Sanampudi, S.K. An automated essay scoring systems: a systematic literature review. Artif Intell Rev 55 , 2495–2527 (2022). https://doi.org/10.1007/s10462-021-10068-2

Download citation

Published : 23 September 2021

Issue Date : March 2022

DOI : https://doi.org/10.1007/s10462-021-10068-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Short answer scoring
  • Essay grading
  • Natural language processing
  • Deep learning
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

An automated essay scoring systems: a systematic literature review

Dadi ramesh.

1 School of Computer Science and Artificial Intelligence, SR University, Warangal, TS India

2 Research Scholar, JNTU, Hyderabad, India

Suresh Kumar Sanampudi

3 Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS India

Associated Data

Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers' student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10462-021-10068-2.

Introduction

Due to COVID 19 outbreak, an online educational system has become inevitable. In the present scenario, almost all the educational institutions ranging from schools to colleges adapt the online education system. The assessment plays a significant role in measuring the learning ability of the student. Most automated evaluation is available for multiple-choice questions, but assessing short and essay answers remain a challenge. The education system is changing its shift to online-mode, like conducting computer-based exams and automatic evaluation. It is a crucial application related to the education domain, which uses natural language processing (NLP) and Machine Learning techniques. The evaluation of essays is impossible with simple programming languages and simple techniques like pattern matching and language processing. Here the problem is for a single question, we will get more responses from students with a different explanation. So, we need to evaluate all the answers concerning the question.

Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. ( 1973 ). PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade the essay. A modified version of the PEG by Shermis et al. ( 2001 ) was released, which focuses on grammar checking with a correlation between human evaluators and the system. Foltz et al. ( 1999 ) introduced an Intelligent Essay Assessor (IEA) by evaluating content using latent semantic analysis to produce an overall score. Powers et al. ( 2002 ) proposed E-rater and Intellimetric by Rudner et al. ( 2006 ) and Bayesian Essay Test Scoring System (BESTY) by Rudner and Liang ( 2002 ), these systems use natural language processing (NLP) techniques that focus on style and content to obtain the score of an essay. The vast majority of the essay scoring systems in the 1990s followed traditional approaches like pattern matching and a statistical-based approach. Since the last decade, the essay grading systems started using regression-based and natural language processing techniques. AES systems like Dong et al. ( 2017 ) and others developed from 2014 used deep learning techniques, inducing syntactic and semantic features resulting in better results than earlier systems.

Ohio, Utah, and most US states are using AES systems in school education, like Utah compose tool, Ohio standardized test (an updated version of PEG), evaluating millions of student's responses every year. These systems work for both formative, summative assessments and give feedback to students on the essay. Utah provided basic essay evaluation rubrics (six characteristics of essay writing): Development of ideas, organization, style, word choice, sentence fluency, conventions. Educational Testing Service (ETS) has been conducting significant research on AES for more than a decade and designed an algorithm to evaluate essays on different domains and providing an opportunity for test-takers to improve their writing skills. In addition, they are current research content-based evaluation.

The evaluation of essay and short answer scoring should consider the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge. Proper assessment of the parameters mentioned above defines the accuracy of the evaluation system. But all these parameters cannot play an equal role in essay scoring and short answer scoring. In a short answer evaluation, domain knowledge is required, like the meaning of "cell" in physics and biology is different. And while evaluating essays, the implementation of ideas with respect to prompt is required. The system should also assess the completeness of the responses and provide feedback.

Several studies examined AES systems, from the initial to the latest AES systems. In which the following studies on AES systems are Blood ( 2011 ) provided a literature review from PEG 1984–2010. Which has covered only generalized parts of AES systems like ethical aspects, the performance of the systems. Still, they have not covered the implementation part, and it’s not a comparative study and has not discussed the actual challenges of AES systems.

Burrows et al. ( 2015 ) Reviewed AES systems on six dimensions like dataset, NLP techniques, model building, grading models, evaluation, and effectiveness of the model. They have not covered feature extraction techniques and challenges in features extractions. Covered only Machine Learning models but not in detail. This system not covered the comparative analysis of AES systems like feature extraction, model building, and level of relevance, cohesion, and coherence not covered in this review.

Ke et al. ( 2019 ) provided a state of the art of AES system but covered very few papers and not listed all challenges, and no comparative study of the AES model. On the other hand, Hussein et al. in ( 2019 ) studied two categories of AES systems, four papers from handcrafted features for AES systems, and four papers from the neural networks approach, discussed few challenges, and did not cover feature extraction techniques, the performance of AES models in detail.

Klebanov et al. ( 2020 ). Reviewed 50 years of AES systems, listed and categorized all essential features that need to be extracted from essays. But not provided a comparative analysis of all work and not discussed the challenges.

This paper aims to provide a systematic literature review (SLR) on automated essay grading systems. An SLR is an Evidence-based systematic review to summarize the existing research. It critically evaluates and integrates all relevant studies' findings and addresses the research domain's specific research questions. Our research methodology uses guidelines given by Kitchenham et al. ( 2009 ) for conducting the review process; provide a well-defined approach to identify gaps in current research and to suggest further investigation.

We addressed our research method, research questions, and the selection process in Sect.  2 , and the results of the research questions have discussed in Sect.  3 . And the synthesis of all the research questions addressed in Sect.  4 . Conclusion and possible future work discussed in Sect.  5 .

Research method

We framed the research questions with PICOC criteria.

Population (P) Student essays and answers evaluation systems.

Intervention (I) evaluation techniques, data sets, features extraction methods.

Comparison (C) Comparison of various approaches and results.

Outcomes (O) Estimate the accuracy of AES systems,

Context (C) NA.

Research questions

To collect and provide research evidence from the available studies in the domain of automated essay grading, we framed the following research questions (RQ):

RQ1 what are the datasets available for research on automated essay grading?

The answer to the question can provide a list of the available datasets, their domain, and access to the datasets. It also provides a number of essays and corresponding prompts.

RQ2 what are the features extracted for the assessment of essays?

The answer to the question can provide an insight into various features so far extracted, and the libraries used to extract those features.

RQ3, which are the evaluation metrics available for measuring the accuracy of algorithms?

The answer will provide different evaluation metrics for accurate measurement of each Machine Learning approach and commonly used measurement technique.

RQ4 What are the Machine Learning techniques used for automatic essay grading, and how are they implemented?

It can provide insights into various Machine Learning techniques like regression models, classification models, and neural networks for implementing essay grading systems. The response to the question can give us different assessment approaches for automated essay grading systems.

RQ5 What are the challenges/limitations in the current research?

The answer to the question provides limitations of existing research approaches like cohesion, coherence, completeness, and feedback.

Search process

We conducted an automated search on well-known computer science repositories like ACL, ACM, IEEE Explore, Springer, and Science Direct for an SLR. We referred to papers published from 2010 to 2020 as much of the work during these years focused on advanced technologies like deep learning and natural language processing for automated essay grading systems. Also, the availability of free data sets like Kaggle (2012), Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) by Yannakoudakis et al. ( 2011 ) led to research this domain.

Search Strings : We used search strings like “Automated essay grading” OR “Automated essay scoring” OR “short answer scoring systems” OR “essay scoring systems” OR “automatic essay evaluation” and searched on metadata.

Selection criteria

After collecting all relevant documents from the repositories, we prepared selection criteria for inclusion and exclusion of documents. With the inclusion and exclusion criteria, it becomes more feasible for the research to be accurate and specific.

Inclusion criteria 1 Our approach is to work with datasets comprise of essays written in English. We excluded the essays written in other languages.

Inclusion criteria 2  We included the papers implemented on the AI approach and excluded the traditional methods for the review.

Inclusion criteria 3 The study is on essay scoring systems, so we exclusively included the research carried out on only text data sets rather than other datasets like image or speech.

Exclusion criteria  We removed the papers in the form of review papers, survey papers, and state of the art papers.

Quality assessment

In addition to the inclusion and exclusion criteria, we assessed each paper by quality assessment questions to ensure the article's quality. We included the documents that have clearly explained the approach they used, the result analysis and validation.

The quality checklist questions are framed based on the guidelines from Kitchenham et al. ( 2009 ). Each quality assessment question was graded as either 1 or 0. The final score of the study range from 0 to 3. A cut off score for excluding a study from the review is 2 points. Since the papers scored 2 or 3 points are included in the final evaluation. We framed the following quality assessment questions for the final study.

Quality Assessment 1: Internal validity.

Quality Assessment 2: External validity.

Quality Assessment 3: Bias.

The two reviewers review each paper to select the final list of documents. We used the Quadratic Weighted Kappa score to measure the final agreement between the two reviewers. The average resulted from the kappa score is 0.6942, a substantial agreement between the reviewers. The result of evolution criteria shown in Table ​ Table1. 1 . After Quality Assessment, the final list of papers for review is shown in Table ​ Table2. 2 . The complete selection process is shown in Fig. ​ Fig.1. 1 . The total number of selected papers in year wise as shown in Fig. ​ Fig.2. 2 .

Quality assessment analysis

Final list of papers

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig1_HTML.jpg

Selection process

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig2_HTML.jpg

Year wise publications

What are the datasets available for research on automated essay grading?

To work with problem statement especially in Machine Learning and deep learning domain, we require considerable amount of data to train the models. To answer this question, we listed all the data sets used for training and testing for automated essay grading systems. The Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) Yannakoudakis et al. ( 2011 ) developed corpora that contain 1244 essays and ten prompts. This corpus evaluates whether a student can write the relevant English sentences without any grammatical and spelling mistakes. This type of corpus helps to test the models built for GRE and TOFEL type of exams. It gives scores between 1 and 40.

Bailey and Meurers ( 2008 ), Created a dataset (CREE reading comprehension) for language learners and automated short answer scoring systems. The corpus consists of 566 responses from intermediate students. Mohler and Mihalcea ( 2009 ). Created a dataset for the computer science domain consists of 630 responses for data structure assignment questions. The scores are range from 0 to 5 given by two human raters.

Dzikovska et al. ( 2012 ) created a Student Response Analysis (SRA) corpus. It consists of two sub-groups: the BEETLE corpus consists of 56 questions and approximately 3000 responses from students in the electrical and electronics domain. The second one is the SCIENTSBANK(SemEval-2013) (Dzikovska et al. 2013a ; b ) corpus consists of 10,000 responses on 197 prompts on various science domains. The student responses ladled with "correct, partially correct incomplete, Contradictory, Irrelevant, Non-domain."

In the Kaggle (2012) competition, released total 3 types of corpuses on an Automated Student Assessment Prize (ASAP1) (“ https://www.kaggle.com/c/asap-sas/ ” ) essays and short answers. It has nearly 17,450 essays, out of which it provides up to 3000 essays for each prompt. It has eight prompts that test 7th to 10th grade US students. It gives scores between the [0–3] and [0–60] range. The limitations of these corpora are: (1) it has a different score range for other prompts. (2) It uses statistical features such as named entities extraction and lexical features of words to evaluate essays. ASAP +  + is one more dataset from Kaggle. It is with six prompts, and each prompt has more than 1000 responses total of 10,696 from 8th-grade students. Another corpus contains ten prompts from science, English domains and a total of 17,207 responses. Two human graders evaluated all these responses.

Correnti et al. ( 2013 ) created a Response-to-Text Assessment (RTA) dataset used to check student writing skills in all directions like style, mechanism, and organization. 4–8 grade students give the responses to RTA. Basu et al. ( 2013 ) created a power grading dataset with 700 responses for ten different prompts from US immigration exams. It contains all short answers for assessment.

The TOEFL11 corpus Blanchard et al. ( 2013 ) contains 1100 essays evenly distributed over eight prompts. It is used to test the English language skills of a candidate attending the TOFEL exam. It scores the language proficiency of a candidate as low, medium, and high.

International Corpus of Learner English (ICLE) Granger et al. ( 2009 ) built a corpus of 3663 essays covering different dimensions. It has 12 prompts with 1003 essays that test the organizational skill of essay writing, and13 prompts, each with 830 essays that examine the thesis clarity and prompt adherence.

Argument Annotated Essays (AAE) Stab and Gurevych ( 2014 ) developed a corpus that contains 102 essays with 101 prompts taken from the essayforum2 site. It tests the persuasive nature of the student essay. The SCIENTSBANK corpus used by Sakaguchi et al. ( 2015 ) available in git-hub, containing 9804 answers to 197 questions in 15 science domains. Table ​ Table3 3 illustrates all datasets related to AES systems.

ALL types Datasets used in Automatic scoring systems

Features play a major role in the neural network and other supervised Machine Learning approaches. The automatic essay grading systems scores student essays based on different types of features, which play a prominent role in training the models. Based on their syntax and semantics and they are categorized into three groups. 1. statistical-based features Contreras et al. ( 2018 ); Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ) 2. Style-based (Syntax) features Cummins et al. ( 2016 ); Darwish and Mohamed ( 2020 ); Ke et al. ( 2019 ). 3. Content-based features Dong et al. ( 2017 ). A good set of features appropriate models evolved better AES systems. The vast majority of the researchers are using regression models if features are statistical-based. For Neural Networks models, researches are using both style-based and content-based features. The following table shows the list of various features used in existing AES Systems. Table ​ Table4 4 represents all set of features used for essay grading.

Types of features

We studied all the feature extracting NLP libraries as shown in Fig. ​ Fig.3. that 3 . that are used in the papers. The NLTK is an NLP tool used to retrieve statistical features like POS, word count, sentence count, etc. With NLTK, we can miss the essay's semantic features. To find semantic features Word2Vec Mikolov et al. ( 2013 ), GloVe Jeffrey Pennington et al. ( 2014 ) is the most used libraries to retrieve the semantic text from the essays. And in some systems, they directly trained the model with word embeddings to find the score. From Fig. ​ Fig.4 4 as observed that non-content-based feature extraction is higher than content-based.

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig3_HTML.jpg

Usages of tools

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig4_HTML.jpg

Number of papers on content based features

RQ3 which are the evaluation metrics available for measuring the accuracy of algorithms?

The majority of the AES systems are using three evaluation metrics. They are (1) quadrated weighted kappa (QWK) (2) Mean Absolute Error (MAE) (3) Pearson Correlation Coefficient (PCC) Shehab et al. ( 2016 ). The quadratic weighted kappa will find agreement between human evaluation score and system evaluation score and produces value ranging from 0 to 1. And the Mean Absolute Error is the actual difference between human-rated score to system-generated score. The mean square error (MSE) measures the average squares of the errors, i.e., the average squared difference between the human-rated and the system-generated scores. MSE will always give positive numbers only. Pearson's Correlation Coefficient (PCC) finds the correlation coefficient between two variables. It will provide three values (0, 1, − 1). "0" represents human-rated and system scores that are not related. "1" represents an increase in the two scores. "− 1" illustrates a negative relationship between the two scores.

RQ4 what are the Machine Learning techniques being used for automatic essay grading, and how are they implemented?

After scrutinizing all documents, we categorize the techniques used in automated essay grading systems into four baskets. 1. Regression techniques. 2. Classification model. 3. Neural networks. 4. Ontology-based approach.

All the existing AES systems developed in the last ten years employ supervised learning techniques. Researchers using supervised methods viewed the AES system as either regression or classification task. The goal of the regression task is to predict the score of an essay. The classification task is to classify the essays belonging to (low, medium, or highly) relevant to the question's topic. Since the last three years, most AES systems developed made use of the concept of the neural network.

Regression based models

Mohler and Mihalcea ( 2009 ). proposed text-to-text semantic similarity to assign a score to the student essays. There are two text similarity measures like Knowledge-based measures, corpus-based measures. There eight knowledge-based tests with all eight models. They found the similarity. The shortest path similarity determines based on the length, which shortest path between two contexts. Leacock & Chodorow find the similarity based on the shortest path's length between two concepts using node-counting. The Lesk similarity finds the overlap between the corresponding definitions, and Wu & Palmer algorithm finds similarities based on the depth of two given concepts in the wordnet taxonomy. Resnik, Lin, Jiang&Conrath, Hirst& St-Onge find the similarity based on different parameters like the concept, probability, normalization factor, lexical chains. In corpus-based likeness, there LSA BNC, LSA Wikipedia, and ESA Wikipedia, latent semantic analysis is trained on Wikipedia and has excellent domain knowledge. Among all similarity scores, correlation scores LSA Wikipedia scoring accuracy is more. But these similarity measure algorithms are not using NLP concepts. These models are before 2010 and basic concept models to continue the research automated essay grading with updated algorithms on neural networks with content-based features.

Adamson et al. ( 2014 ) proposed an automatic essay grading system which is a statistical-based approach in this they retrieved features like POS, Character count, Word count, Sentence count, Miss spelled words, n-gram representation of words to prepare essay vector. They formed a matrix with these all vectors in that they applied LSA to give a score to each essay. It is a statistical approach that doesn’t consider the semantics of the essay. The accuracy they got when compared to the human rater score with the system is 0.532.

Cummins et al. ( 2016 ). Proposed Timed Aggregate Perceptron vector model to give ranking to all the essays, and later they converted the rank algorithm to predict the score of the essay. The model trained with features like Word unigrams, bigrams, POS, Essay length, grammatical relation, Max word length, sentence length. It is multi-task learning, gives ranking to the essays, and predicts the score for the essay. The performance evaluated through QWK is 0.69, a substantial agreement between the human rater and the system.

Sultan et al. ( 2016 ). Proposed a Ridge regression model to find short answer scoring with Question Demoting. Question Demoting is the new concept included in the essay's final assessment to eliminate duplicate words from the essay. The extracted features are Text Similarity, which is the similarity between the student response and reference answer. Question Demoting is the number of repeats in a student response. With inverse document frequency, they assigned term weight. The sentence length Ratio is the number of words in the student response, is another feature. With these features, the Ridge regression model was used, and the accuracy they got 0.887.

Contreras et al. ( 2018 ). Proposed Ontology based on text mining in this model has given a score for essays in phases. In phase-I, they generated ontologies with ontoGen and SVM to find the concept and similarity in the essay. In phase II from ontologies, they retrieved features like essay length, word counts, correctness, vocabulary, and types of word used, domain information. After retrieving statistical data, they used a linear regression model to find the score of the essay. The accuracy score is the average of 0.5.

Darwish and Mohamed ( 2020 ) proposed the fusion of fuzzy Ontology with LSA. They retrieve two types of features, like syntax features and semantic features. In syntax features, they found Lexical Analysis with tokens, and they construct a parse tree. If the parse tree is broken, the essay is inconsistent—a separate grade assigned to the essay concerning syntax features. The semantic features are like similarity analysis, Spatial Data Analysis. Similarity analysis is to find duplicate sentences—Spatial Data Analysis for finding Euclid distance between the center and part. Later they combine syntax features and morphological features score for the final score. The accuracy they achieved with the multiple linear regression model is 0.77, mostly on statistical features.

Süzen Neslihan et al. ( 2020 ) proposed a text mining approach for short answer grading. First, their comparing model answers with student response by calculating the distance between two sentences. By comparing the model answer with student response, they find the essay's completeness and provide feedback. In this approach, model vocabulary plays a vital role in grading, and with this model vocabulary, the grade will be assigned to the student's response and provides feedback. The correlation between the student answer to model answer is 0.81.

Classification based Models

Persing and Ng ( 2013 ) used a support vector machine to score the essay. The features extracted are OS, N-gram, and semantic text to train the model and identified the keywords from the essay to give the final score.

Sakaguchi et al. ( 2015 ) proposed two methods: response-based and reference-based. In response-based scoring, the extracted features are response length, n-gram model, and syntactic elements to train the support vector regression model. In reference-based scoring, features such as sentence similarity using word2vec is used to find the cosine similarity of the sentences that is the final score of the response. First, the scores were discovered individually and later combined two features to find a final score. This system gave a remarkable increase in performance by combining the scores.

Mathias and Bhattacharyya ( 2018a ; b ) Proposed Automated Essay Grading Dataset with Essay Attribute Scores. The first concept features selection depends on the essay type. So the common attributes are Content, Organization, Word Choice, Sentence Fluency, Conventions. In this system, each attribute is scored individually, with the strength of each attribute identified. The model they used is a random forest classifier to assign scores to individual attributes. The accuracy they got with QWK is 0.74 for prompt 1 of the ASAS dataset ( https://www.kaggle.com/c/asap-sas/ ).

Ke et al. ( 2019 ) used a support vector machine to find the response score. In this method, features like Agreeability, Specificity, Clarity, Relevance to prompt, Conciseness, Eloquence, Confidence, Direction of development, Justification of opinion, and Justification of importance. First, the individual parameter score obtained was later combined with all scores to give a final response score. The features are used in the neural network to find whether the sentence is relevant to the topic or not.

Salim et al. ( 2019 ) proposed an XGBoost Machine Learning classifier to assess the essays. The algorithm trained on features like word count, POS, parse tree depth, and coherence in the articles with sentence similarity percentage; cohesion and coherence are considered for training. And they implemented K-fold cross-validation for a result the average accuracy after specific validations is 68.12.

Neural network models

Shehab et al. ( 2016 ) proposed a neural network method that used learning vector quantization to train human scored essays. After training, the network can provide a score to the ungraded essays. First, we should process the essay to remove Spell checking and then perform preprocessing steps like Document Tokenization, stop word removal, Stemming, and submit it to the neural network. Finally, the model will provide feedback on the essay, whether it is relevant to the topic. And the correlation coefficient between human rater and system score is 0.7665.

Kopparapu and De ( 2016 ) proposed the Automatic Ranking of Essays using Structural and Semantic Features. This approach constructed a super essay with all the responses. Next, ranking for a student essay is done based on the super-essay. The structural and semantic features derived helps to obtain the scores. In a paragraph, 15 Structural features like an average number of sentences, the average length of sentences, and the count of words, nouns, verbs, adjectives, etc., are used to obtain a syntactic score. A similarity score is used as semantic features to calculate the overall score.

Dong and Zhang ( 2016 ) proposed a hierarchical CNN model. The model builds two layers with word embedding to represents the words as the first layer. The second layer is a word convolution layer with max-pooling to find word vectors. The next layer is a sentence-level convolution layer with max-pooling to find the sentence's content and synonyms. A fully connected dense layer produces an output score for an essay. The accuracy with the hierarchical CNN model resulted in an average QWK of 0.754.

Taghipour and Ng ( 2016 ) proposed a first neural approach for essay scoring build in which convolution and recurrent neural network concepts help in scoring an essay. The network uses a lookup table with the one-hot representation of the word vector of an essay. The final efficiency of the network model with LSTM resulted in an average QWK of 0.708.

Dong et al. ( 2017 ). Proposed an Attention-based scoring system with CNN + LSTM to score an essay. For CNN, the input parameters were character embedding and word embedding, and it has attention pooling layers and used NLTK to obtain word and character embedding. The output gives a sentence vector, which provides sentence weight. After CNN, it will have an LSTM layer with an attention pooling layer, and this final layer results in the final score of the responses. The average QWK score is 0.764.

Riordan et al. ( 2017 ) proposed a neural network with CNN and LSTM layers. Word embedding, given as input to a neural network. An LSTM network layer will retrieve the window features and delivers them to the aggregation layer. The aggregation layer is a superficial layer that takes a correct window of words and gives successive layers to predict the answer's sore. The accuracy of the neural network resulted in a QWK of 0.90.

Zhao et al. ( 2017 ) proposed a new concept called Memory-Augmented Neural network with four layers, input representation layer, memory addressing layer, memory reading layer, and output layer. An input layer represents all essays in a vector form based on essay length. After converting the word vector, the memory addressing layer takes a sample of the essay and weighs all the terms. The memory reading layer takes the input from memory addressing segment and finds the content to finalize the score. Finally, the output layer will provide the final score of the essay. The accuracy of essay scores is 0.78, which is far better than the LSTM neural network.

Mathias and Bhattacharyya ( 2018a ; b ) proposed deep learning networks using LSTM with the CNN layer and GloVe pre-trained word embeddings. For this, they retrieved features like Sentence count essays, word count per sentence, Number of OOVs in the sentence, Language model score, and the text's perplexity. The network predicted the goodness scores of each essay. The higher the goodness scores, means higher the rank and vice versa.

Nguyen and Dery ( 2016 ). Proposed Neural Networks for Automated Essay Grading. In this method, a single layer bi-directional LSTM accepting word vector as input. Glove vectors used in this method resulted in an accuracy of 90%.

Ruseti et al. ( 2018 ) proposed a recurrent neural network that is capable of memorizing the text and generate a summary of an essay. The Bi-GRU network with the max-pooling layer molded on the word embedding of each document. It will provide scoring to the essay by comparing it with a summary of the essay from another Bi-GRU network. The result obtained an accuracy of 0.55.

Wang et al. ( 2018a ; b ) proposed an automatic scoring system with the bi-LSTM recurrent neural network model and retrieved the features using the word2vec technique. This method generated word embeddings from the essay words using the skip-gram model. And later, word embedding is used to train the neural network to find the final score. The softmax layer in LSTM obtains the importance of each word. This method used a QWK score of 0.83%.

Dasgupta et al. ( 2018 ) proposed a technique for essay scoring with augmenting textual qualitative Features. It extracted three types of linguistic, cognitive, and psychological features associated with a text document. The linguistic features are Part of Speech (POS), Universal Dependency relations, Structural Well-formedness, Lexical Diversity, Sentence Cohesion, Causality, and Informativeness of the text. The psychological features derived from the Linguistic Information and Word Count (LIWC) tool. They implemented a convolution recurrent neural network that takes input as word embedding and sentence vector, retrieved from the GloVe word vector. And the second layer is the Convolution Layer to find local features. The next layer is the recurrent neural network (LSTM) to find corresponding of the text. The accuracy of this method resulted in an average QWK of 0.764.

Liang et al. ( 2018 ) proposed a symmetrical neural network AES model with Bi-LSTM. They are extracting features from sample essays and student essays and preparing an embedding layer as input. The embedding layer output is transfer to the convolution layer from that LSTM will be trained. Hear the LSRM model has self-features extraction layer, which will find the essay's coherence. The average QWK score of SBLSTMA is 0.801.

Liu et al. ( 2019 ) proposed two-stage learning. In the first stage, they are assigning a score based on semantic data from the essay. The second stage scoring is based on some handcrafted features like grammar correction, essay length, number of sentences, etc. The average score of the two stages is 0.709.

Pedro Uria Rodriguez et al. ( 2019 ) proposed a sequence-to-sequence learning model for automatic essay scoring. They used BERT (Bidirectional Encoder Representations from Transformers), which extracts the semantics from a sentence from both directions. And XLnet sequence to sequence learning model to extract features like the next sentence in an essay. With this pre-trained model, they attained coherence from the essay to give the final score. The average QWK score of the model is 75.5.

Xia et al. ( 2019 ) proposed a two-layer Bi-directional LSTM neural network for the scoring of essays. The features extracted with word2vec to train the LSTM and accuracy of the model in an average of QWK is 0.870.

Kumar et al. ( 2019 ) Proposed an AutoSAS for short answer scoring. It used pre-trained Word2Vec and Doc2Vec models trained on Google News corpus and Wikipedia dump, respectively, to retrieve the features. First, they tagged every word POS and they found weighted words from the response. It also found prompt overlap to observe how the answer is relevant to the topic, and they defined lexical overlaps like noun overlap, argument overlap, and content overlap. This method used some statistical features like word frequency, difficulty, diversity, number of unique words in each response, type-token ratio, statistics of the sentence, word length, and logical operator-based features. This method uses a random forest model to train the dataset. The data set has sample responses with their associated score. The model will retrieve the features from both responses like graded and ungraded short answers with questions. The accuracy of AutoSAS with QWK is 0.78. It will work on any topics like Science, Arts, Biology, and English.

Jiaqi Lun et al. ( 2020 ) proposed an automatic short answer scoring with BERT. In this with a reference answer comparing student responses and assigning scores. The data augmentation is done with a neural network and with one correct answer from the dataset classifying reaming responses as correct or incorrect.

Zhu and Sun ( 2020 ) proposed a multimodal Machine Learning approach for automated essay scoring. First, they count the grammar score with the spaCy library and numerical count as the number of words and sentences with the same library. With this input, they trained a single and Bi LSTM neural network for finding the final score. For the LSTM model, they prepared sentence vectors with GloVe and word embedding with NLTK. Bi-LSTM will check each sentence in both directions to find semantic from the essay. The average QWK score with multiple models is 0.70.

Ontology based approach

Mohler et al. ( 2011 ) proposed a graph-based method to find semantic similarity in short answer scoring. For the ranking of answers, they used the support vector regression model. The bag of words is the main feature extracted in the system.

Ramachandran et al. ( 2015 ) also proposed a graph-based approach to find lexical based semantics. Identified phrase patterns and text patterns are the features to train a random forest regression model to score the essays. The accuracy of the model in a QWK is 0.78.

Zupanc et al. ( 2017 ) proposed sentence similarity networks to find the essay's score. Ajetunmobi and Daramola ( 2017 ) recommended an ontology-based information extraction approach and domain-based ontology to find the score.

Speech response scoring

Automatic scoring is in two ways one is text-based scoring, other is speech-based scoring. This paper discussed text-based scoring and its challenges, and now we cover speech scoring and common points between text and speech-based scoring. Evanini and Wang ( 2013 ), Worked on speech scoring of non-native school students, extracted features with speech ratter, and trained a linear regression model, concluding that accuracy varies based on voice pitching. Loukina et al. ( 2015 ) worked on feature selection from speech data and trained SVM. Malinin et al. ( 2016 ) used neural network models to train the data. Loukina et al. ( 2017 ). Proposed speech and text-based automatic scoring. Extracted text-based features, speech-based features and trained a deep neural network for speech-based scoring. They extracted 33 types of features based on acoustic signals. Malinin et al. ( 2017 ). Wu Xixin et al. ( 2020 ) Worked on deep neural networks for spoken language assessment. Incorporated different types of models and tested them. Ramanarayanan et al. ( 2017 ) worked on feature extraction methods and extracted punctuation, fluency, and stress and trained different Machine Learning models for scoring. Knill et al. ( 2018 ). Worked on Automatic speech recognizer and its errors how its impacts the speech assessment.

The state of the art

This section provides an overview of the existing AES systems with a comparative study w. r. t models, features applied, datasets, and evaluation metrics used for building the automated essay grading systems. We divided all 62 papers into two sets of the first set of review papers in Table ​ Table5 5 with a comparative study of the AES systems.

State of the art

Comparison of all approaches

In our study, we divided major AES approaches into three categories. Regression models, classification models, and neural network models. The regression models failed to find cohesion and coherence from the essay because it trained on BoW(Bag of Words) features. In processing data from input to output, the regression models are less complicated than neural networks. There are unable to find many intricate patterns from the essay and unable to find sentence connectivity. If we train the model with BoW features in the neural network approach, the model never considers the essay's coherence and coherence.

First, to train a Machine Learning algorithm with essays, all the essays are converted to vector form. We can form a vector with BoW and Word2vec, TF-IDF. The BoW and Word2vec vector representation of essays represented in Table ​ Table6. 6 . The vector representation of BoW with TF-IDF is not incorporating the essays semantic, and it’s just statistical learning from a given vector. Word2vec vector comprises semantic of essay in a unidirectional way.

Vector representation of essays

In BoW, the vector contains the frequency of word occurrences in the essay. The vector represents 1 and more based on the happenings of words in the essay and 0 for not present. So, in BoW, the vector does not maintain the relationship with adjacent words; it’s just for single words. In word2vec, the vector represents the relationship between words with other words and sentences prompt in multiple dimensional ways. But word2vec prepares vectors in a unidirectional way, not in a bidirectional way; word2vec fails to find semantic vectors when a word has two meanings, and the meaning depends on adjacent words. Table ​ Table7 7 represents a comparison of Machine Learning models and features extracting methods.

Comparison of models

In AES, cohesion and coherence will check the content of the essay concerning the essay prompt these can be extracted from essay in the vector from. Two more parameters are there to access an essay is completeness and feedback. Completeness will check whether student’s response is sufficient or not though the student wrote correctly. Table ​ Table8 8 represents all four parameters comparison for essay grading. Table ​ Table9 9 illustrates comparison of all approaches based on various features like grammar, spelling, organization of essay, relevance.

Comparison of all models with respect to cohesion, coherence, completeness, feedback

comparison of all approaches on various features

What are the challenges/limitations in the current research?

From our study and results discussed in the previous sections, many researchers worked on automated essay scoring systems with numerous techniques. We have statistical methods, classification methods, and neural network approaches to evaluate the essay automatically. The main goal of the automated essay grading system is to reduce human effort and improve consistency.

The vast majority of essay scoring systems are dealing with the efficiency of the algorithm. But there are many challenges in automated essay grading systems. One should assess the essay by following parameters like the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge.

No model works on the relevance of content, which means whether student response or explanation is relevant to the given prompt or not if it is relevant to how much it is appropriate, and there is no discussion about the cohesion and coherence of the essays. All researches concentrated on extracting the features using some NLP libraries, trained their models, and testing the results. But there is no explanation in the essay evaluation system about consistency and completeness, But Palma and Atkinson ( 2018 ) explained coherence-based essay evaluation. And Zupanc and Bosnic ( 2014 ) also used the word coherence to evaluate essays. And they found consistency with latent semantic analysis (LSA) for finding coherence from essays, but the dictionary meaning of coherence is "The quality of being logical and consistent."

Another limitation is there is no domain knowledge-based evaluation of essays using Machine Learning models. For example, the meaning of a cell is different from biology to physics. Many Machine Learning models extract features with WordVec and GloVec; these NLP libraries cannot convert the words into vectors when they have two or more meanings.

Other challenges that influence the Automated Essay Scoring Systems.

All these approaches worked to improve the QWK score of their models. But QWK will not assess the model in terms of features extraction and constructed irrelevant answers. The QWK is not evaluating models whether the model is correctly assessing the answer or not. There are many challenges concerning students' responses to the Automatic scoring system. Like in evaluating approach, no model has examined how to evaluate the constructed irrelevant and adversarial answers. Especially the black box type of approaches like deep learning models provides more options to the students to bluff the automated scoring systems.

The Machine Learning models that work on statistical features are very vulnerable. Based on Powers et al. ( 2001 ) and Bejar Isaac et al. ( 2014 ), the E-rater was failed on Constructed Irrelevant Responses Strategy (CIRS). From the study of Bejar et al. ( 2013 ), Higgins and Heilman ( 2014 ), observed that when student response contain irrelevant content or shell language concurring to prompt will influence the final score of essays in an automated scoring system.

In deep learning approaches, most of the models automatically read the essay's features, and some methods work on word-based embedding and other character-based embedding features. From the study of Riordan Brain et al. ( 2019 ), The character-based embedding systems do not prioritize spelling correction. However, it is influencing the final score of the essay. From the study of Horbach and Zesch ( 2019 ), Various factors are influencing AES systems. For example, there are data set size, prompt type, answer length, training set, and human scorers for content-based scoring.

Ding et al. ( 2020 ) reviewed that the automated scoring system is vulnerable when a student response contains more words from prompt, like prompt vocabulary repeated in the response. Parekh et al. ( 2020 ) and Kumar et al. ( 2020 ) tested various neural network models of AES by iteratively adding important words, deleting unimportant words, shuffle the words, and repeating sentences in an essay and found that no change in the final score of essays. These neural network models failed to recognize common sense in adversaries' essays and give more options for the students to bluff the automated systems.

Other than NLP and ML techniques for AES. From Wresch ( 1993 ) to Madnani and Cahill ( 2018 ). discussed the complexity of AES systems, standards need to be followed. Like assessment rubrics to test subject knowledge, irrelevant responses, and ethical aspects of an algorithm like measuring the fairness of student response.

Fairness is an essential factor for automated systems. For example, in AES, fairness can be measure in an agreement between human score to machine score. Besides this, From Loukina et al. ( 2019 ), the fairness standards include overall score accuracy, overall score differences, and condition score differences between human and system scores. In addition, scoring different responses in the prospect of constructive relevant and irrelevant will improve fairness.

Madnani et al. ( 2017a ; b ). Discussed the fairness of AES systems for constructed responses and presented RMS open-source tool for detecting biases in the models. With this, one can change fairness standards according to their analysis of fairness.

From Berzak et al.'s ( 2018 ) approach, behavior factors are a significant challenge in automated scoring systems. That helps to find language proficiency, word characteristics (essential words from the text), predict the critical patterns from the text, find related sentences in an essay, and give a more accurate score.

Rupp ( 2018 ), has discussed the designing, evaluating, and deployment methodologies for AES systems. They provided notable characteristics of AES systems for deployment. They are like model performance, evaluation metrics for a model, threshold values, dynamically updated models, and framework.

First, we should check the model performance on different datasets and parameters for operational deployment. Selecting Evaluation metrics for AES models are like QWK, correlation coefficient, or sometimes both. Kelley and Preacher ( 2012 ) have discussed three categories of threshold values: marginal, borderline, and acceptable. The values can be varied based on data size, model performance, type of model (single scoring, multiple scoring models). Once a model is deployed and evaluates millions of responses every time for optimal responses, we need a dynamically updated model based on prompt and data. Finally, framework designing of AES model, hear a framework contains prompts where test-takers can write the responses. One can design two frameworks: a single scoring model for a single methodology and multiple scoring models for multiple concepts. When we deploy multiple scoring models, each prompt could be trained separately, or we can provide generalized models for all prompts with this accuracy may vary, and it is challenging.

Our Systematic literature review on the automated essay grading system first collected 542 papers with selected keywords from various databases. After inclusion and exclusion criteria, we left with 139 articles; on these selected papers, we applied Quality assessment criteria with two reviewers, and finally, we selected 62 writings for final review.

Our observations on automated essay grading systems from 2010 to 2020 are as followed:

  • The implementation techniques of automated essay grading systems are classified into four buckets; there are 1. regression models 2. Classification models 3. Neural networks 4. Ontology-based methodology, but using neural networks, the researchers are more accurate than other techniques, and all the methods state of the art provided in Table ​ Table3 3 .
  • The majority of the regression and classification models on essay scoring used statistical features to find the final score. It means the systems or models trained on such parameters as word count, sentence count, etc. though the parameters extracted from the essay, the algorithm are not directly training on essays. The algorithms trained on some numbers obtained from the essay and hear if numbers matched the composition will get a good score; otherwise, the rating is less. In these models, the evaluation process is entirely on numbers, irrespective of the essay. So, there is a lot of chance to miss the coherence, relevance of the essay if we train our algorithm on statistical parameters.
  • In the neural network approach, the models trained on Bag of Words (BoW) features. The BoW feature is missing the relationship between a word to word and the semantic meaning of the sentence. E.g., Sentence 1: John killed bob. Sentence 2: bob killed John. In these two sentences, the BoW is "John," "killed," "bob."
  • In the Word2Vec library, if we are prepared a word vector from an essay in a unidirectional way, the vector will have a dependency with other words and finds the semantic relationship with other words. But if a word has two or more meanings like "Bank loan" and "River Bank," hear bank has two implications, and its adjacent words decide the sentence meaning; in this case, Word2Vec is not finding the real meaning of the word from the sentence.
  • The features extracted from essays in the essay scoring system are classified into 3 type's features like statistical features, style-based features, and content-based features, which are explained in RQ2 and Table ​ Table3. 3 . But statistical features, are playing a significant role in some systems and negligible in some systems. In Shehab et al. ( 2016 ); Cummins et al. ( 2016 ). Dong et al. ( 2017 ). Dong and Zhang ( 2016 ). Mathias and Bhattacharyya ( 2018a ; b ) Systems the assessment is entirely on statistical and style-based features they have not retrieved any content-based features. And in other systems that extract content from the essays, the role of statistical features is for only preprocessing essays but not included in the final grading.
  • In AES systems, coherence is the main feature to be considered while evaluating essays. The actual meaning of coherence is to stick together. That is the logical connection of sentences (local level coherence) and paragraphs (global level coherence) in a story. Without coherence, all sentences in a paragraph are independent and meaningless. In an Essay, coherence is a significant feature that is explaining everything in a flow and its meaning. It is a powerful feature in AES system to find the semantics of essay. With coherence, one can assess whether all sentences are connected in a flow and all paragraphs are related to justify the prompt. Retrieving the coherence level from an essay is a critical task for all researchers in AES systems.
  • In automatic essay grading systems, the assessment of essays concerning content is critical. That will give the actual score for the student. Most of the researches used statistical features like sentence length, word count, number of sentences, etc. But according to collected results, 32% of the systems used content-based features for the essay scoring. Example papers which are on content-based assessment are Taghipour and Ng ( 2016 ); Persing and Ng ( 2013 ); Wang et al. ( 2018a , 2018b ); Zhao et al. ( 2017 ); Kopparapu and De ( 2016 ), Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ); Mohler and Mihalcea ( 2009 ) are used content and statistical-based features. The results are shown in Fig. ​ Fig.3. 3 . And mainly the content-based features extracted with word2vec NLP library, but word2vec is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other terms, but word2vec is capable of capturing the context word in a uni-direction either left or right. If a word has multiple meanings, there is a chance of missing the context in the essay. After analyzing all the papers, we found that content-based assessment is a qualitative assessment of essays.
  • On the other hand, Horbach and Zesch ( 2019 ); Riordan Brain et al. ( 2019 ); Ding et al. ( 2020 ); Kumar et al. ( 2020 ) proved that neural network models are vulnerable when a student response contains constructed irrelevant, adversarial answers. And a student can easily bluff an automated scoring system by submitting different responses like repeating sentences and repeating prompt words in an essay. From Loukina et al. ( 2019 ), and Madnani et al. ( 2017b ). The fairness of an algorithm is an essential factor to be considered in AES systems.
  • While talking about speech assessment, the data set contains audios of duration up to one minute. Feature extraction techniques are entirely different from text assessment, and accuracy varies based on speaking fluency, pitching, male to female voice and boy to adult voice. But the training algorithms are the same for text and speech assessment.
  • Once an AES system evaluates essays and short answers accurately in all directions, there is a massive demand for automated systems in the educational and related world. Now AES systems are deployed in GRE, TOEFL exams; other than these, we can deploy AES systems in massive open online courses like Coursera(“ https://coursera.org/learn//machine-learning//exam ”), NPTEL ( https://swayam.gov.in/explorer ), etc. still they are assessing student performance with multiple-choice questions. In another perspective, AES systems can be deployed in information retrieval systems like Quora, stack overflow, etc., to check whether the retrieved response is appropriate to the question or not and can give ranking to the retrieved answers.

Conclusion and future work

As per our Systematic literature review, we studied 62 papers. There exist significant challenges for researchers in implementing automated essay grading systems. Several researchers are working rigorously on building a robust AES system despite its difficulty in solving this problem. All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test. Feature extraction is with NLTK, WordVec, and GloVec NLP libraries; these libraries have many limitations while converting a sentence into vector form. Apart from feature extraction and training Machine Learning models, no system is accessing the essay's completeness. No system provides feedback to the student response and not retrieving coherence vectors from the essay—another perspective the constructive irrelevant and adversarial student responses still questioning AES systems.

Our proposed research work will go on the content-based assessment of essays with domain knowledge and find a score for the essays with internal and external consistency. And we will create a new dataset concerning one domain. And another area in which we can improve is the feature extraction techniques.

This study includes only four digital databases for study selection may miss some functional studies on the topic. However, we hope that we covered most of the significant studies as we manually collected some papers published in useful journals.

Below is the link to the electronic supplementary material.

Not Applicable.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Dadi Ramesh, Email: moc.liamg@44hsemaridad .

Suresh Kumar Sanampudi, Email: ni.ca.hutnj@idupmanashserus .

  • Adamson, A., Lamb, A., & December, R. M. (2014). Automated Essay Grading.
  • Ajay HB, Tillett PI, Page EB (1973) Analysis of essays by computer (AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education, National Center for Educational Research and Development
  • Ajetunmobi SA, Daramola O (2017) Ontology-based information extraction for subject-focussed automatic essay evaluation. In: 2017 International Conference on Computing Networking and Informatics (ICCNI) p 1–6. IEEE
  • Alva-Manchego F, et al. (2019) EASSE: Easier Automatic Sentence Simplification Evaluation.” ArXiv abs/1908.04567 (2019): n. pag
  • Bailey S, Meurers D (2008) Diagnosing meaning errors in short answers to reading comprehension questions. In: Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (Columbus), p 107–115
  • Basu S, Jacobs C, Vanderwende L. Powergrading: a clustering approach to amplify human effort for short answer grading. Trans Assoc Comput Linguist (TACL) 2013; 1 :391–402. doi: 10.1162/tacl_a_00236. [ CrossRef ] [ Google Scholar ]
  • Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing, 22, 48-59.
  • Bejar I, et al. (2013) Length of Textual Response as a Construct-Irrelevant Response Strategy: The Case of Shell Language. Research Report. ETS RR-13-07.” ETS Research Report Series (2013): n. pag
  • Berzak Y, et al. (2018) “Assessing Language Proficiency from Eye Movements in Reading.” ArXiv abs/1804.07329 (2018): n. pag
  • Blanchard D, Tetreault J, Higgins D, Cahill A, Chodorow M (2013) TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15, 2013
  • Blood, I. (2011). Automated essay scoring: a literature review. Studies in Applied Linguistics and TESOL, 11(2).
  • Burrows S, Gurevych I, Stein B. The eras and trends of automatic short answer grading. Int J Artif Intell Educ. 2015; 25 :60–117. doi: 10.1007/s40593-014-0026-8. [ CrossRef ] [ Google Scholar ]
  • Cader, A. (2020, July). The Potential for the Use of Deep Neural Networks in e-Learning Student Evaluation with New Data Augmentation Method. In International Conference on Artificial Intelligence in Education (pp. 37–42). Springer, Cham.
  • Cai C (2019) Automatic essay scoring with recurrent neural network. In: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications (2019): n. pag.
  • Chen M, Li X (2018) "Relevance-Based Automated Essay Scoring via Hierarchical Recurrent Model. In: 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 2018, p 378–383, doi: 10.1109/IALP.2018.8629256
  • Chen Z, Zhou Y (2019) "Research on Automatic Essay Scoring of Composition Based on CNN and OR. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, p 13–18, doi: 10.1109/ICAIBD.2019.8837007
  • Contreras JO, Hilles SM, Abubakar ZB (2018) Automated essay scoring with ontology based on text mining and NLTK tools. In: 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 1-6
  • Correnti R, Matsumura LC, Hamilton L, Wang E. Assessing students’ skills at writing analytically in response to texts. Elem Sch J. 2013; 114 (2):142–177. doi: 10.1086/671936. [ CrossRef ] [ Google Scholar ]
  • Cummins, R., Zhang, M., & Briscoe, E. (2016, August). Constrained multi-task learning for automated essay scoring. Association for Computational Linguistics.
  • Darwish SM, Mohamed SK (2020) Automated essay evaluation based on fusion of fuzzy ontology and latent semantic analysis. In: Hassanien A, Azar A, Gaber T, Bhatnagar RF, Tolba M (eds) The International Conference on Advanced Machine Learning Technologies and Applications
  • Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 93–102
  • Ding Y, et al. (2020) "Don’t take “nswvtnvakgxpm” for an answer–The surprising vulnerability of automatic content scoring systems to adversarial input." In: Proceedings of the 28th International Conference on Computational Linguistics
  • Dong F, Zhang Y (2016) Automatic features for essay scoring–an empirical study. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing p 1072–1077
  • Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) p 153–162
  • Dzikovska M, Nielsen R, Brew C, Leacock C, Gi ampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013a) Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge
  • Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Trang Dang H (2013b) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. *SEM 2013: The First Joint Conference on Lexical and Computational Semantics
  • Educational Testing Service (2008) CriterionSM online writing evaluation service. Retrieved from http://www.ets.org/s/criterion/pdf/9286_CriterionBrochure.pdf .
  • Evanini, K., & Wang, X. (2013, August). Automated speech scoring for non-native middle school students with multiple task types. In INTERSPEECH (pp. 2435–2439).
  • Foltz PW, Laham D, Landauer TK (1999) The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1, 2, http://imej.wfu.edu/articles/1999/2/04/ index.asp
  • Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (Eds.). (2009). International corpus of learner English. Louvain-la-Neuve: Presses universitaires de Louvain.
  • Higgins D, Heilman M. Managing what we can measure: quantifying the susceptibility of automated scoring systems to gaming behavior” Educ Meas Issues Pract. 2014; 33 :36–46. doi: 10.1111/emip.12036. [ CrossRef ] [ Google Scholar ]
  • Horbach A, Zesch T. The influence of variance in learner answers on automatic content scoring. Front Educ. 2019; 4 :28. doi: 10.3389/feduc.2019.00028. [ CrossRef ] [ Google Scholar ]
  • https://www.coursera.org/learn/machine-learning/exam/7pytE/linear-regression-with-multiple-variables/attempt
  • Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. [ PMC free article ] [ PubMed ]
  • Ke Z, Ng V (2019) “Automated essay scoring: a survey of the state of the art.” IJCAI
  • Ke, Z., Inamdar, H., Lin, H., & Ng, V. (2019, July). Give me more feedback II: Annotating thesis strength and related attributes in student essays. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3994-4004).
  • Kelley K, Preacher KJ. On effect size. Psychol Methods. 2012; 17 (2):137–152. doi: 10.1037/a0028086. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering–a systematic literature review. Inf Softw Technol. 2009; 51 (1):7–15. doi: 10.1016/j.infsof.2008.09.009. [ CrossRef ] [ Google Scholar ]
  • Klebanov, B. B., & Madnani, N. (2020, July). Automated evaluation of writing–50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7796–7810).
  • Knill K, Gales M, Kyriakopoulos K, et al. (4 more authors) (2018) Impact of ASR performance on free speaking language assessment. In: Interspeech 2018.02–06 Sep 2018, Hyderabad, India. International Speech Communication Association (ISCA)
  • Kopparapu SK, De A (2016) Automatic ranking of essays using structural and semantic features. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), p 519–523
  • Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019, July). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 9662–9669).
  • Kumar Y, et al. (2020) “Calling out bluff: attacking the robustness of automatic scoring systems with simple adversarial testing.” ArXiv abs/2007.06796
  • Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-Based Automated Essay Scoring Using Self-attention. In: Sun M, Liu T, Wang X, Liu Z, Liu Y (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, vol 11221. Springer, Cham. 10.1007/978-3-030-01716-3_32
  • Liang G, On B, Jeong D, Kim H, Choi G. Automated essay scoring: a siamese bidirectional LSTM neural network architecture. Symmetry. 2018; 10 :682. doi: 10.3390/sym10120682. [ CrossRef ] [ Google Scholar ]
  • Liua, H., Yeb, Y., & Wu, M. (2018, April). Ensemble Learning on Scoring Student Essay. In 2018 International Conference on Management and Education, Humanities and Social Sciences (MEHSS 2018). Atlantis Press.
  • Liu J, Xu Y, Zhao L (2019) Automated Essay Scoring based on Two-Stage Learning. ArXiv, abs/1901.07744
  • Loukina A, et al. (2015) Feature selection for automated speech scoring.” BEA@NAACL-HLT
  • Loukina A, et al. (2017) “Speech- and Text-driven Features for Automated Scoring of English-Speaking Tasks.” SCNLP@EMNLP 2017
  • Loukina A, et al. (2019) The many dimensions of algorithmic fairness in educational applications. BEA@ACL
  • Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(09): 13389-13396
  • Madnani, N., & Cahill, A. (2018, August). Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109).
  • Madnani N, et al. (2017b) “Building better open-source tools to support fairness in automated scoring.” EthNLP@EACL
  • Malinin A, et al. (2016) “Off-topic response detection for spontaneous spoken english assessment.” ACL
  • Malinin A, et al. (2017) “Incorporating uncertainty into deep learning for spoken language assessment.” ACL
  • Mathias S, Bhattacharyya P (2018a) Thank “Goodness”! A Way to Measure Style in Student Essays. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 35–41
  • Mathias S, Bhattacharyya P (2018b) ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  • Mikolov T, et al. (2013) “Efficient Estimation of Word Representations in Vector Space.” ICLR
  • Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) p 567–575
  • Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies p 752–762
  • Muangkammuen P, Fukumoto F (2020) Multi-task Learning for Automated Essay Scoring with Sentiment Analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop p 116–123
  • Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1–11.
  • Palma D, Atkinson J. Coherence-based automatic essay assessment. IEEE Intell Syst. 2018; 33 (5):26–36. doi: 10.1109/MIS.2018.2877278. [ CrossRef ] [ Google Scholar ]
  • Parekh S, et al (2020) My Teacher Thinks the World Is Flat! Interpreting Automatic Essay Scoring Mechanism.” ArXiv abs/2012.13872 (2020): n. pag
  • Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
  • Persing I, Ng V (2013) Modeling thesis clarity in student essays. In:Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) p 260–269
  • Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K. Stumping E-Rater: challenging the validity of automated essay scoring. ETS Res Rep Ser. 2001; 2001 (1):i–44. [ Google Scholar ]
  • Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K. Stumping e-rater: challenging the validity of automated essay scoring. Comput Hum Behav. 2002; 18 (2):103–134. doi: 10.1016/S0747-5632(01)00052-8. [ CrossRef ] [ Google Scholar ]
  • Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications p 97–106
  • Ramanarayanan V, et al. (2017) “Human and Automated Scoring of Fluency, Pronunciation and Intonation During Human-Machine Spoken Dialog Interactions.” INTERSPEECH
  • Riordan B, Horbach A, Cahill A, Zesch T, Lee C (2017) Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications p 159–168
  • Riordan B, Flor M, Pugh R (2019) "How to account for misspellings: Quantifying the benefit of character representations in neural content scoring models."In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
  • Rodriguez P, Jafari A, Ormerod CM (2019) Language models and Automated Essay Scoring. ArXiv, abs/1909.09482
  • Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes' theorem. The Journal of Technology, Learning and Assessment, 1(2).
  • Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4).
  • Rupp A. Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl Meas Educ. 2018; 31 :191–214. doi: 10.1080/08957347.2018.1464448. [ CrossRef ] [ Google Scholar ]
  • Ruseti S, Dascalu M, Johnson AM, McNamara DS, Balyan R, McCarthy KS, Trausan-Matu S (2018) Scoring summaries using recurrent neural networks. In: International Conference on Intelligent Tutoring Systems p 191–201. Springer, Cham
  • Sakaguchi K, Heilman M, Madnani N (2015) Effective feature integration for automated short answer scoring. In: Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies p 1049–1054
  • Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., & Suhartono, D. (2019, December). Automated English Digital Essay Grader Using Machine Learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) (pp. 1–6). IEEE.
  • Shehab A, Elhoseny M, Hassanien AE (2016) A hybrid scheme for Automated Essay Grading based on LVQ and NLP techniques. In: 12th International Computer Engineering Conference (ICENCO), Cairo, 2016, p 65-70
  • Shermis MD, Mzumara HR, Olson J, Harrington S. On-line grading of student essays: PEG goes on the World Wide Web. Assess Eval High Educ. 2001; 26 (3):247–259. doi: 10.1080/02602930120052404. [ CrossRef ] [ Google Scholar ]
  • Stab C, Gurevych I (2014) Identifying argumentative discourse structures in persuasive essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) p 46–56
  • Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies p 1070–1075
  • Süzen, N., Gorban, A. N., Levesley, J., & Mirkes, E. M. (2020). Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 169, 726–743.
  • Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the 2016 conference on empirical methods in natural language processing p 1882–1891
  • Tashu TM (2020) "Off-Topic Essay Detection Using C-BGRU Siamese. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, p 221–225, doi: 10.1109/ICSC.2020.00046
  • Tashu TM, Horváth T (2019) A layered approach to automatic essay evaluation using word-embedding. In: McLaren B, Reilly R, Zvacek S, Uhomoibhi J (eds) Computer Supported Education. CSEDU 2018. Communications in Computer and Information Science, vol 1022. Springer, Cham
  • Tashu TM, Horváth T (2020) Semantic-Based Feedback Recommendation for Automatic Essay Evaluation. In: Bi Y, Bhatia R, Kapoor S (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham
  • Uto M, Okano M (2020) Robust Neural Automated Essay Scoring Using Item Response Theory. In: Bittencourt I, Cukurova M, Muldner K, Luckin R, Millán E (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham
  • Wang Z, Liu J, Dong R (2018a) Intelligent Auto-grading System. In: 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS) p 430–435. IEEE.
  • Wang Y, et al. (2018b) “Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning.” EMNLP
  • Zhu W, Sun Y (2020) Automated essay scoring system using multi-model Machine Learning, david c. wyld et al. (eds): mlnlp, bdiot, itccma, csity, dtmn, aifz, sigpro
  • Wresch W. The Imminence of Grading Essays by Computer-25 Years Later. Comput Compos. 1993; 10 :45–58. doi: 10.1016/S8755-4615(05)80058-1. [ CrossRef ] [ Google Scholar ]
  • Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment.
  • Xia L, Liu J, Zhang Z (2019) Automatic Essay Scoring Model Based on Two-Layer Bi-directional Long-Short Term Memory Network. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence p 133–137
  • Yannakoudakis H, Briscoe T, Medlock B (2011) A new dataset and method for automatically grading ESOL texts. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies p 180–189
  • Zhao S, Zhang Y, Xiong X, Botelho A, Heffernan N (2017) A memory-augmented neural model for automated grading. In: Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale p 189–192
  • Zupanc K, Bosnic Z (2014) Automated essay evaluation augmented with semantic coherence measures. In: 2014 IEEE International Conference on Data Mining p 1133–1138. IEEE.
  • Zupanc K, Savić M, Bosnić Z, Ivanović M (2017) Evaluating coherence of essays using sentence-similarity networks. In: Proceedings of the 18th International Conference on Computer Systems and Technologies p 65–72
  • Dzikovska, M. O., Nielsen, R., & Brew, C. (2012, June). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In  Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies  (pp. 200-210).
  • Kumar, N., & Dey, L. (2013, November). Automatic Quality Assessment of documents with application to essay grading. In 2013 12th Mexican International Conference on Artificial Intelligence (pp. 216–222). IEEE.
  • Wu, S. H., & Shih, W. F. (2018, July). A short answer grading system in chinese by support vector approach. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications (pp. 125-129).
  • Agung Putri Ratna, A., Lalita Luhurkinanti, D., Ibrahim I., Husna D., Dewi Purnamasari P. (2018). Automatic Essay Grading System for Japanese Language Examination Using Winnowing Algorithm, 2018 International Seminar on Application for Technology of Information and Communication, 2018, pp. 565–569. 10.1109/ISEMANTIC.2018.8549789.
  • Sharma A., & Jayagopi D. B. (2018). Automated Grading of Handwritten Essays 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp 279–284. 10.1109/ICFHR-2018.2018.00056

Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader, edit dataset modalities, edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Student essay.

student essay dataset

Student Essay is widely used in research on argument segmentation

Benchmarks Edit Add a new result Link an existing benchmark

Dataset loaders edit add remove, similar datasets, debatepedia, sentence-level argument annotation, license edit, modalities edit, languages edit.

A Bi-modal Automated Essay Scoring System for Handwritten Essays

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Hidding the Ghostwriters: An Adversarial Evaluation of AI -Generated Student Essay Detection

Xinlin Peng , Ying Zhou , Ben He , Le Sun , Yingfei Sun

Export citation

  • Preformatted

Markdown (Informal)

[Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection](https://aclanthology.org/2023.emnlp-main.644) (Peng et al., EMNLP 2023)

  • Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection (Peng et al., EMNLP 2023)
  • Xinlin Peng, Ying Zhou, Ben He, Le Sun, and Yingfei Sun. 2023. Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 10406–10419, Singapore. Association for Computational Linguistics.

Announcing the NeurIPS 2023 Paper Awards 

Communications Chairs 2023 2023 Conference awards , neurips2023

By Amir Globerson, Kate Saenko, Moritz Hardt, Sergey Levine and Comms Chair, Sahra Ghalebikesabi 

We are honored to announce the award-winning papers for NeurIPS 2023! This year’s prestigious awards consist of the Test of Time Award plus two Outstanding Paper Awards in each of these three categories: 

  • Two Outstanding Main Track Papers 
  • Two Outstanding Main Track Runner-Ups 
  • Two Outstanding Datasets and Benchmark Track Papers  

This year’s organizers received a record number of paper submissions. Of the 13,300 submitted papers that were reviewed by 968 Area Chairs, 98 senior area chairs, and 396 Ethics reviewers 3,540  were accepted after 502 papers were flagged for ethics reviews . 

We thank the awards committee for the main track: Yoav Artzi, Chelsea Finn, Ludwig Schmidt, Ricardo Silva, Isabel Valera, and Mengdi Wang. For the Datasets and Benchmarks track, we thank Sergio Escalera, Isabelle Guyon, Neil Lawrence, Dina Machuve, Olga Russakovsky, Hugo Jair Escalante, Deepti Ghadiyaram, and Serena Yeung. Conflicts of interest were taken into account in the decision process.

Congratulations to all the authors! See Posters Sessions Tue-Thur in Great Hall & B1-B2 (level 1).

Outstanding Main Track Papers

Privacy Auditing with One (1) Training Run Authors: Thomas Steinke · Milad Nasr · Matthew Jagielski

Poster session 2: Tue 12 Dec 5:15 p.m. — 7:15 p.m. CST, #1523

Oral: Tue 12 Dec 3:40 p.m. — 4:40 p.m. CST, Room R06-R09 (level 2)

Abstract: We propose a scheme for auditing differentially private machine learning systems with a single training run. This exploits the parallelism of being able to add or remove multiple training examples independently. We analyze this using the connection between differential privacy and statistical generalization, which avoids the cost of group privacy. Our auditing scheme requires minimal assumptions about the algorithm and can be applied in the black-box or white-box setting. We demonstrate the effectiveness of our framework by applying it to DP-SGD, where we can achieve meaningful empirical privacy lower bounds by training only one model. In contrast, standard methods would require training hundreds of models.

Are Emergent Abilities of Large Language Models a Mirage? Authors: Rylan Schaeffer · Brando Miranda · Sanmi Koyejo

Poster session 6: Thu 14 Dec 5:00 p.m. — 7:00 p.m. CST, #1108

Oral: Thu 14 Dec 3:20 p.m. — 3:35 p.m. CST, Hall C2 (level 1) 

Abstract: Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability , appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous, predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities, (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

Outstanding Main Track Runner-Ups

Scaling Data-Constrained Language Models Authors : Niklas Muennighoff · Alexander Rush · Boaz Barak · Teven Le Scao · Nouamane Tazi · Aleksandra Piktus · Sampo Pyysalo · Thomas Wolf · Colin Raffel

Poster session 2: Tue 12 Dec 5:15 p.m. — 7:15 p.m. CST, #813

Oral: Tue 12 Dec 3:40 p.m. — 4:40 p.m. CST, Hall C2 (level 1)  

Abstract : The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations .

Direct Preference Optimization: Your Language Model is Secretly a Reward Model Authors: Rafael Rafailov · Archit Sharma · Eric Mitchell · Christopher D Manning · Stefano Ermon · Chelsea Finn

Poster session 6: Thu 14 Dec 5:00 p.m. — 7:00 p.m. CST, #625

Oral: Thu 14 Dec 3:50 p.m. — 4:05 p.m. CST, Ballroom A-C (level 2)  

Abstract: While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF’s ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Outstanding Datasets and Benchmarks Papers

In the dataset category : 

ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation

Authors:  Sungduk Yu · Walter Hannah · Liran Peng · Jerry Lin · Mohamed Aziz Bhouri · Ritwik Gupta · Björn Lütjens · Justus C. Will · Gunnar Behrens · Julius Busecke · Nora Loose · Charles Stern · Tom Beucler · Bryce Harrop · Benjamin Hillman · Andrea Jenney · Savannah L. Ferretti · Nana Liu · Animashree Anandkumar · Noah Brenowitz · Veronika Eyring · Nicholas Geneva · Pierre Gentine · Stephan Mandt · Jaideep Pathak · Akshay Subramaniam · Carl Vondrick · Rose Yu · Laure Zanna · Tian Zheng · Ryan Abernathey · Fiaz Ahmed · David Bader · Pierre Baldi · Elizabeth Barnes · Christopher Bretherton · Peter Caldwell · Wayne Chuang · Yilun Han · YU HUANG · Fernando Iglesias-Suarez · Sanket Jantre · Karthik Kashinath · Marat Khairoutdinov · Thorsten Kurth · Nicholas Lutsko · Po-Lun Ma · Griffin Mooers · J. David Neelin · David Randall · Sara Shamekh · Mark Taylor · Nathan Urban · Janni Yuval · Guang Zhang · Mike Pritchard

Poster session 4: Wed 13 Dec 5:00 p.m. — 7:00 p.m. CST, #105 

Oral: Wed 13 Dec 3:45 p.m. — 4:00 p.m. CST, Ballroom A-C (level 2)

Abstract: Modern climate projections lack adequate spatial and temporal resolution due to computational constraints. A consequence is inaccurate and imprecise predictions of critical processes such as storms. Hybrid methods that combine physics with machine learning (ML) have introduced a new generation of higher fidelity climate simulators that can sidestep Moore’s Law by outsourcing compute-hungry, short, high-resolution simulations to ML emulators. However, this hybrid ML-physics simulation approach requires domain-specific treatment and has been inaccessible to ML experts because of lack of training data and relevant, easy-to-use workflows. We present ClimSim, the largest-ever dataset designed for hybrid ML-physics research. It comprises multi-scale climate simulations, developed by a consortium of climate scientists and ML researchers. It consists of 5.7 billion pairs of multivariate input and output vectors that isolate the influence of locally-nested, high-resolution, high-fidelity physics on a host climate simulator’s macro-scale physical state. The dataset is global in coverage, spans multiple years at high sampling frequency, and is designed such that resulting emulators are compatible with downstream coupling into operational climate simulators. We implement a range of deterministic and stochastic regression baselines to highlight the ML challenges and their scoring. The data (https://huggingface.co/datasets/LEAP/ClimSim_high-res) and code (https://leap-stc.github.io/ClimSim) are released openly to support the development of hybrid ML-physics and high-fidelity climate simulations for the benefit of science and society.   

In the benchmark category :

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Authors: Boxin Wang · Weixin Chen · Hengzhi Pei · Chulin Xie · Mintong Kang · Chenhui Zhang · Chejian Xu · Zidi Xiong · Ritik Dutta · Rylan Schaeffer · Sang Truong · Simran Arora · Mantas Mazeika · Dan Hendrycks · Zinan Lin · Yu Cheng · Sanmi Koyejo · Dawn Song · Bo Li

Poster session 1: Tue 12 Dec 10:45 a.m. — 12:45 p.m. CST, #1618  

Oral: Tue 12 Dec 10:30 a.m. — 10:45 a.m. CST, Ballroom A-C (Level 2)

Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance – where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives – including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that GPT-4 follows the (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/.

Test of Time

This year, following the usual practice, we chose a NeurIPS paper from 10 years ago to receive the Test of Time Award, and “ Distributed Representations of Words and Phrases and their Compositionality ” by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean, won. 

Published at NeurIPS 2013 and cited over 40,000 times, the work introduced the seminal word embedding technique word2vec. Demonstrating the power of learning from large amounts of unstructured text, the work catalyzed progress that marked the beginning of a new era in natural language processing.

Greg Corrado and Jeffrey Dean will be giving a talk about this work and related research on Tuesday, 12 Dec at 3:05 – 3:25 pm CST in Hall F.  

Related Posts

2023 Conference

Announcing NeurIPS 2023 Invited Talks

Reflections on the neurips 2023 ethics review process, neurips newsletter – november 2023.

  • Manage Account
  • Voter Guide
  • Solar Eclipse
  • Bleeding Out
  • Things to Do
  • Public Notices
  • Help Center

news Education

How are computers scoring STAAR essays? Texas superintendents, lawmaker want answers

Educators and legislators are concerned about transparency and a spike in high schoolers scoring zero points on written answers..

Texas superintendents want answers from the state education commissioner Mike Morath about...

By Talia Richman

11:10 AM on Feb 15, 2024 CST — Updated at 8:00 PM on Feb 15, 2024 CST

Texas superintendents — and at least one lawmaker — want answers from the state education commissioner about how computers are scoring STAAR essays.

The Texas Education Agency quietly debuted a new system for examining student answers on the State of Texas Assessments of Academic Readiness, or STAAR, in December . Roughly three-quarters of written responses are scored by a computer rather than a person.

“This is surprising news to me as a member of the House Public Education Committee, as I do not recall ever receiving notice of this novel and experimental method for grading high-stakes, STAAR tests,” Rep. Gina Hinojosa, D-Austin, wrote in a recent letter to Commissioner Mike Morath, which was also shared with The Dallas Morning News .

Superintendents across the state also were caught off guard until recently. Many school districts already are suing the state over changes to the academic accountability system that’s largely based on STAAR performance.

Receive our in-depth coverage of education issues and stories that affect North Texans.

By signing up you agree to our  Terms of Service  and  Privacy Policy

Related: Computers scoring Texas students’ STAAR essay answers, state officials say

The News reported on the rollout of computer scoring Wednesday.

The use of computers to score essays “was never communicated to school districts; yet this seems to be an unprecedented change that a ‘heads up’ would be reasonably warranted,” HD Chambers, director of the Texas School Alliance, wrote to Morath in a letter shared with The News .

TEA spokesman Jake Kobersky said in a statement that the agency is developing a comprehensive presentation for educators, explaining the changes in detail and addressing outstanding questions.

He added that the agency alerted the House Public Education Committee in August 2022 that it was pursuing automated scoring.

The final bulletpoint on an 18-page slideshow read: “TEA is pursuing automation for scoring where appropriate to reduce costs while ensuring reliability. Full human scoring is not possible under item-level computer-adaptive (B), and full human scoring with no automation under the current system would require at least $15-20M more per year.”

The new scoring method rolled out amid a broader STAAR redesign. The revamped test — which launched last year — has a cap on multiple choice questions and essays at every grade level. State officials say it would cost millions more to have only humans score the test.

The “automated scoring engines” are programmed to emulate how humans would assess an essay, and they don’t learn beyond a single question. The computer determines how to score written answers after analyzing thousands of students’ responses that were previously scored by people.

Among the district leaders’ biggest concerns is a huge spike in low scores among high schoolers under the new system.

Roughly eight in 10 written responses on the most recent English II End of Course exam received zero points this fall.

For the spring test — the first iteration of the redesigned test, but scored only by humans — roughly a quarter of responses scored zero points in the same subject.

Members of the Texas School Alliance , which represents 46 districts, “examined their individual district results and found shockingly consistent scoring differences.”

Chris Rozunick, the director of the state’s assessment development division, previously told The News that she understands why people connect the spike in zeroes to the rollout of automated scoring based on the timing. But she insists that the two are unrelated.

Many students who take STAAR in the fall are “re-testers” who did not meet grade level on a previous test attempt. Spring testers tend to perform better, according to agency officials who were asked to explain the spike in low scores in the fall.

“It really is the population of testers much more than anything else,” Rozunick said.

Kobersky added that, under the previous STAAR design, a score of zero was reserved for “unscorable responses,” meaning the question was left blank or written in a nonsensical way. The redesigned test rubric allows for a zero both if a response is unscorable or if it’s the value of the response as determined by the scorer, he said.

Some district leaders requested the state education agency provide them images of students’ responses so that they could “better understand what led to the significant increase in the number of zeroes, and most importantly how to help students write their responses” to receive better scores.

“Each request has been denied,” Chambers wrote in his letter to Morath.

Kobersky said fall questions are not released because they can be reused for other tests.

TEA officials say a technical report, with a detailed overview of the system, will be available later this year.

STAAR scores are of tremendous importance to district leaders, families and communities. Schools are graded on the state’s academic accountability system largely based on how students perform on these standardized tests.

Related: What are Texas’ A-F school grades, and why do they matter?

“As with all aspects of the STAAR test and the A-F accountability system, it is important that there is transparency, accuracy and fairness in these high-stakes results,” Hinojosa wrote.

The DMN Education Lab deepens the coverage and conversation about urgent education issues critical to the future of North Texas.

The DMN Education Lab is a community-funded journalism initiative, with support from Bobby and Lottye Lyle, Communities Foundation of Texas, The Dallas Foundation, Dallas Regional Chamber, Deedie Rose, Garrett and Cecilia Boone, The Meadows Foundation, The Murrell Foundation, Solutions Journalism Network, Southern Methodist University, Sydney Smith Hicks and the University of Texas at Dallas. The Dallas Morning News retains full editorial control of the Education Lab’s journalism.

Talia Richman

Talia Richman , Staff writer . Talia is a reporter for The Dallas Morning News Education Lab. A Dallas native, she attended Richardson High School and graduated from the University of Maryland. She previously covered schools and City Hall for The Baltimore Sun.

Dallas-Fort Worth home to some of the worst drivers in the U.S., per study

Mavericks appear poised for strong season finish, unlike last year’s post-all-star swoon, 1 dead, 1 injured in red bird shooting; no arrests made, police say, veteran ot tyron smith isn’t thinking of retirement, wants cowboys return for 2024, capital one to buy discover for $35 billion in deal combining u.s. credit card giants.

Is the University of Michigan Is Selling Student Data to Train AI?

It is unclear whether those included in the datasets consented to having their audio and texts used in such a manner.

Tony Ho Tran

Tony Ho Tran

Senior Editor, Innovation & Tech

The University of Michigan M is seen at the Michigan football stadium in Ann Arbor, Michigan, U.S., September 18, 2018.

REUTERS/Rebecca Cook

The University of Michigan has allegedly sold 85 hours of audio recordings from various academic settings including lectures, interviews, office hours, study groups, and student presentations to third parties for the purposes of training artificial intelligence . The school has also allegedly sold a dataset of 829 academic papers from students to help fine tune large language models (LLMs) as well.

It is unclear whether those included in the data consented to having their audio and texts used in such a manner. However, a sample dataset downloaded by The Daily Beast included a recording of a lecture from 1999 making it highly unlikely that they knew their data would be used to train future generative AI models.

AI engineer Susan Zhang took to X to post a screenshot showing what looks to be an advertisement from Catalyst Research Alliance, a firm selling the UM data, that she recently received on LinkedIn. The sender wrote that they were “reaching out because, based on your profile, you may be working with” LLMs.

“I wanted to let you know that the University of Michigan is licensing academic speech data and student papers that could be very useful for training or tuning LLMs,” the user wrote.

“So I guess this is a thing now,” Zhang said. “Universities running ads to resell students data for training LLMs.”

In a statement to The Daily Beast, UM spokesperson Colleen Mastony said that the ad was "sent out by a new third party vendor that shared inaccurate information and has since been asked to halt their work."

"Student data was not and has never been for sale by the University of Michigan," Mastony said. She added that the papers and speech recordings were "voluntarily contributed by student volunteers" who participated in two research studies "under signed consent." One study occurred between 1997 and 2000, while the other occurred between 2006 and 2007.

However, the nature of UM's relationship with Catalyst Research Alliance as a "third party vendor" was still unclear. Whether or not the students knew that their data would later be sold to help train AI was also not clear. Catalyst Research Alliance did not respond when reached for comment.

According to the firm's website, the cost of licensing the datasets varies depending on whether or not customers want to purchase just the audio recordings or the papers as well. However, the price goes as high as $25,000 for both datasets.

“The University of Michigan has recorded 65 speech events from a wide range of academic settings, including lectures, discussion sections, interviews, office hours, study groups, seminars and student presentations,” Catalyst Research Alliance said on its website . “Speakers represent broad demographics, including male and female and native and non-native English speakers from a wide variety of academic disciplines.”

The sample dataset included an audio lecture titled “Graduate Cellular Biotechnology Lecture” dated Feb. 1, 1999. In it, the unidentified lecturer speaks for roughly an hour and a half. The dataset also included a .txt file of a paper titled “The Democratic Inadequacies of the European Union.”

If true, the licensing deal is just another example of how personal data is being packaged and sold to help fuel emerging technologies such as generative AI. Even students whose work is completely unrelated to AI and LLMs can find their voice and writings being used in order to help train them.

“The whole thing feels deeply unethical,” Charles Logan, a learning sciences PhD candidate at Northwestern University, told The Daily Beast. Logan saw Zhang’s post on X and also commented on the situation, decrying it as the “logical progression of data capitalism.”

“When students are in a class or attending office hours there’s a trust implicit in that relationship,” Logan said. “They’re there to learn.”

He added that even if they are consenting to be a part of these datasets “there are still ways that they’re leaky.” “Private companies are monetizing student intellectual property and conversations that, if you’re in office hours or study groups, are deeply personal.”

Brett Karlan, an assistant professor of philosophy who researchers AI ethics, told The Daily Beast in an email that there's an "ethical tension" between the role of the university and its alleged sale of student data. "Students, ideally, should feel that they are able to try things out and make mistakes in classes, with those mistakes not being available to anyone except the professor," he explained. He added that now the material might end up training AI. "This feels like, if not a fundamental betrayal of the mission of a university."

That said, there is some room for doubt. Mastony said that the papers and recordings have "long been available for free to academics" and have "been used as a tool to improve writing and articulation in education. She added that "none of the papers or recordings included identifying information, such as names or other personal data."

“My first reaction is one of skepticism,” Vincent Conitzer, an AI ethics researcher at Carnegie Mellon University, told The Daily Beast. “Also, even taking this message mostly at face value, I suppose it may just all be based on recordings and papers that are anyway in the public domain.”

He added that “it seems odd to me to imagine the university at the highest levels standing behind something like what this message is suggesting.”

Got a tip? Send it to The Daily Beast  here .

READ THIS LIST

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

A Chinese argumentative student essay dataset for Organization Evaluation and Discourse Element Identification

cnunlp/Chinese-Essay-Dataset-For-Organization-Evaluation

Folders and files, repository files navigation, chinese essay dataset for organization evaluation.

A Chinese argumentative student essay dataset for Organization Evaluation and Sentence Function Identification.

Introduction

We built a Chinese argumentative student essay dataset for Organization Evaluation and Sentence Function Identification.

Task Definition

  • Organization Evaluation(OE) . Evaluating the organization of argumentative student essays.
  • Sentence Function Identification(SFI) . Identifying the function of sentence in argumentative student essays.
  • Paragragh Function Identification(PFI) . Identifying the function of paragragh in argumentative student essays.

Discourse Elements

The concept of discourse elements is borrowed from [Burstein et al., 2003] and indicates the functions of sentences. In this paper, we refer discourse elements as both sentence functions and paragraph functions.

Sentence Functions. We mainly follow the definition and taxonomy proposed by [Burstein et al., 2003] except that we divide support into evidence and elaboration to give more details. The sentence functions include:

   Introduction is to introduce the background or attract readers’ attention before making claims.

   Thesis expresses the central claim of the writer with respect to the essays topic.

   Main Idea asserts foundational ideas or aspects that are related to the thesis.

   Evidence indicates examples and other types of evidence that are used to support the main ideas and thesis.

   Elaboration further explains the main ideas or evidence, but contains no evidence.

   Conclusion summarizes the full essay and echos or extends the central claim.

Paragraph Functions. The function of a paragraph is determined according to the functions of its sentences. We consider the following paragraph functions:

   IntroductionPara contains introduction sentences but does not have thesis or main idea sentences.

   ThesisPara contains at least a thesis sentence.

   IdeaPara contains at least a main idea sentence but does not have a thesis sentence.

   SupportPara contains evidence or elaboration sentences but does not contain thesis, main idea or conclusion sentences.

   ConclusionPara contains conclusion sentences but does not have thesis sentences.

Organization Grades

We represent organization quality with three grades.

   Bad The essay is poorly structured. It is incomplete or misses key discourse elements.

   Medium The essay is well structured and complete, but could be further improved.

   Great The essay is fairly well structured and the organization is very clear and logical.

Basic Statistics

Dataset file.

The description of keys of json data in all_data.json, as follows:

An example code of loading dataset as follows:

The dataset is released with this paper:

The references:

[1] Jill Burstein, Daniel Marcu, and Kevin Knight. 2003. Finding the write stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems, 18(1):32–39.

  • Python 100.0%
  • Newsletters
  • Account Activating this button will toggle the display of additional content Account Sign out

The Loss of Things I Took for Granted

Ten years into my college teaching career, students stopped being able to read effectively..

Recent years have seen successive waves of book bans in Republican-controlled states, aimed at pulling any text with “woke” themes from classrooms and library shelves. Though the results sometimes seem farcical, as with the banning of Art Spiegelman’s Maus due to its inclusion of “cuss words” and explicit rodent nudity, the book-banning agenda is no laughing matter. Motivated by bigotry, it has already done demonstrable harm and promises to do more. But at the same time, the appropriate response is, in principle, simple. Named individuals have advanced explicit policies with clear goals and outcomes, and we can replace those individuals with people who want to reverse those policies. That is already beginning to happen in many places, and I hope those successes will continue until every banned book is restored.

If and when that happens, however, we will not be able to declare victory quite yet. Defeating the open conspiracy to deprive students of physical access to books will do little to counteract the more diffuse confluence of forces that are depriving students of the skills needed to meaningfully engage with those books in the first place. As a college educator, I am confronted daily with the results of that conspiracy-without-conspirators. I have been teaching in small liberal arts colleges for over 15 years now, and in the past five years, it’s as though someone flipped a switch. For most of my career, I assigned around 30 pages of reading per class meeting as a baseline expectation—sometimes scaling up for purely expository readings or pulling back for more difficult texts. (No human being can read 30 pages of Hegel in one sitting, for example.) Now students are intimidated by anything over 10 pages and seem to walk away from readings of as little as 20 pages with no real understanding. Even smart and motivated students struggle to do more with written texts than extract decontextualized take-aways. Considerable class time is taken up simply establishing what happened in a story or the basic steps of an argument—skills I used to be able to take for granted.

Since this development very directly affects my ability to do my job as I understand it, I talk about it a lot. And when I talk about it with nonacademics, certain predictable responses inevitably arise, all questioning the reality of the trend I describe. Hasn’t every generation felt that the younger cohort is going to hell in a handbasket? Haven’t professors always complained that educators at earlier levels are not adequately equipping their students? And haven’t students from time immemorial skipped the readings?

The response of my fellow academics, however, reassures me that I’m not simply indulging in intergenerational grousing. Anecdotally, I have literally never met a professor who did not share my experience. Professors are also discussing the issue in academic trade publications , from a variety of perspectives. What we almost all seem to agree on is that we are facing new obstacles in structuring and delivering our courses, requiring us to ratchet down expectations in the face of a ratcheting down of preparation. Yes, there were always students who skipped the readings, but we are in new territory when even highly motivated honors students struggle to grasp the basic argument of a 20-page article. Yes, professors never feel satisfied that high school teachers have done enough, but not every generation of professors has had to deal with the fallout of No Child Left Behind and Common Core. Finally, yes, every generation thinks the younger generation is failing to make the grade— except for the current cohort of professors, who are by and large more invested in their students’ success and mental health and more responsive to student needs than any group of educators in human history. We are not complaining about our students. We are complaining about what has been taken from them.

If we ask what has caused this change, there are some obvious culprits. The first is the same thing that has taken away almost everyone’s ability to focus—the ubiquitous smartphone. Even as a career academic who studies the Quran in Arabic for fun, I have noticed my reading endurance flagging. I once found myself boasting at a faculty meeting that I had read through my entire hourlong train ride without looking at my phone. My colleagues agreed this was a major feat, one they had not achieved recently. Even if I rarely attain that high level of focus, though, I am able to “turn it on” when demanded, for instance to plow through a big novel during a holiday break. That’s because I was able to develop and practice those skills of extended concentration and attentive reading before the intervention of the smartphone. For children who were raised with smartphones, by contrast, that foundation is missing. It is probably no coincidence that the iPhone itself, originally released in 2007, is approaching college age, meaning that professors are increasingly dealing with students who would have become addicted to the dopamine hit of the omnipresent screen long before they were introduced to the more subtle pleasures of the page.

The second go-to explanation is the massive disruption of school closures during COVID-19. There is still some debate about the necessity of those measures, but what is not up for debate any longer is the very real learning loss that students suffered at every level. The impact will inevitably continue to be felt for the next decade or more, until the last cohort affected by the mass “pivot to online” finally graduates. I doubt that the pandemic closures were the decisive factor in themselves, however. Not only did the marked decline in reading resilience start before the pandemic, but the students I am seeing would have already been in high school during the school closures. Hence they would be better equipped to get something out of the online format and, more importantly, their basic reading competence would have already been established.

Less discussed than these broader cultural trends over which educators have little control are the major changes in reading pedagogy that have occurred in recent decades—some motivated by the ever-increasing demand to “teach to the test” and some by fads coming out of schools of education. In the latter category is the widely discussed decline in phonics education in favor of the “balanced literacy” approach advocated by education expert Lucy Calkins (who has more recently come to accept the need for more phonics instruction). I started to see the results of this ill-advised change several years ago, when students abruptly stopped attempting to sound out unfamiliar words and instead paused until they recognized the whole word as a unit. (In a recent class session, a smart, capable student was caught short by the word circumstances when reading a text out loud.) The result of this vibes-based literacy is that students never attain genuine fluency in reading. Even aside from the impact of smartphones, their experience of reading is constantly interrupted by their intentionally cultivated inability to process unfamiliar words.

For all the flaws of the balanced literacy method, it was presumably implemented by people who thought it would help. It is hard to see a similar motivation in the growing trend toward assigning students only the kind of short passages that can be included in a standardized test. Due in part to changes driven by the infamous Common Core standards , teachers now have to fight to assign their students longer readings, much less entire books, because those activities won’t feed directly into students getting higher test scores, which leads to schools getting more funding. The emphasis on standardized tests was always a distraction at best, but we have reached the point where it is actively cannibalizing students’ educational experience—an outcome no one intended or planned, and for which there is no possible justification.

We can’t go back in time and do the pandemic differently at this point, nor is there any realistic path to putting the smartphone genie back in the bottle. (Though I will note that we as a society do at least attempt to keep other addictive products out of the hands of children.) But I have to think that we can, at the very least, stop actively preventing young people from developing the ability to follow extended narratives and arguments in the classroom. Regardless of their profession or ultimate educational level, they will need those skills. The world is a complicated place. People—their histories and identities, their institutions and work processes, their fears and desires—are simply too complex to be captured in a worksheet with a paragraph and some reading comprehension questions. Large-scale prose writing is the best medium we have for capturing that complexity, and the education system should not be in the business of keeping students from learning how to engage effectively with it.

This is a matter not of snobbery, but of basic justice. I recognize that not everyone centers their lives on books as much as a humanities professor does. I think they’re missing out, but they’re adults and they can choose how to spend their time. What’s happening with the current generation is not that they are simply choosing TikTok over Jane Austen. They are being deprived of the ability to choose—for no real reason or benefit. We can and must stop perpetrating this crime on our young people.

comscore beacon

Help | Advanced Search

Computer Science > Computation and Language

Title: large language models: a survey.

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.

Submission history

Access paper:.

  • Download PDF
  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • International edition
  • Australia edition
  • Europe edition

Student in a library

The software says my student cheated using AI. They say they’re innocent. Who do I believe?

In the desperate scramble to combat AI, there is a real danger of penalising students who have done nothing wrong

  • Robert Topinka a senior lecturer in media and cultural studies at Birkbeck, University of London

W hen I sat down to mark undergraduate student essays in the spring of 2023, the hype around ChatGPT was already at giddy heights. Like teachers everywhere, I was worried that students would succumb to the temptation to outsource their thinking to the machine. Many universities, including mine, responded by adopting AI detection software, and I soon had my fears confirmed when it provided the following judgment on one of the essays: “100% AI-generated”.

Essays are marked anonymously, so my heart dropped when I found out that the first “100% AI-generated” essay I marked belonged to a brilliant, incisive thinker whose essays in the pre-ChatGPT era were consistently excellent, if somewhat formulaic in style.

I found myself in an increasingly common predicament, caught between software products and humans: students and ChatGPT on one side, lecturers and AI detectors on the other. Policy demands that I refer essays with high AI detection scores for academic misconduct, something that can lead to steep penalties, including expulsion. But my standout student contested the referral, claiming university-approved support software they used for spelling and grammar included limited generative AI capabilities that had been mistaken for ChatGPT.

The software that scanned my student’s essay is provided by Turnitin, an American “education technology” giant that is one of the biggest players in the academic misconduct market. Before ChatGPT, Turnitin’s primary function was to produce “similarity reports” by checking essays against a database of websites and previously submitted student work. A high similarity score does not always mean plagiarism – some students just quote abundantly – but does make it easier to find copy-and-paste jobs.

Generative AI makes copying and pasting seem old-fashioned. Prompted with an essay question, ChatGPT produces word combinations that won’t show up in a similarity report. Facing a threat to its business model, Turnitin has responded with an AI detection software that measures whether an essay strings words together in predictable patterns – as ChatGPT does – or in the more idiosyncratic style of a human. But the tool is not definitive: while the label announces that an essay is “X% AI-generated”, a link in fine print below the percentage opens a disclaimer that admits it only “might be”.

Unlike the “similarity report”, which includes links to sources so that lecturers can verify whether a student plagiarised or used too many quotations, the AI detection software is a black box. ChatGPT has more than 180 million monthly users, and it produces different – if formulaic – text for all of them. There is no reliable way to reproduce the same text for the same prompt, let alone to know how students might prompt it. Students and lecturers are caught in an AI guessing game. It’s not hard to find students sharing tips online about evading AI detection with paraphrasing tools and AI “humanisers” . It’s also not hard to find desperate students asking how to beat false accusations based on unreliable AI detection.

When my student contested the AI detector’s judgment, I granted the appeal. I admit to trusting the human over the machine. But the defence was also convincing, and this particular student had been consistently writing in this style long before ChatGPT came into being. Still, I was making a high-stakes call without reliable evidence. It was a distressing experience for my student, and one that is being repeated across the sector.

Many academics have translated the hype around AI to a heightened suspicion of students. And it’s true that ChatGPT can plausibly write mediocre university-level essays. A combination of ChatGPT and AI “humanisers” might even carry someone through university with a 2:2.

But if universities treat this as an arms race, it will inevitably harm students who rely on additional support to survive a system that is overwhelmingly biased to white, middle-class, native English speakers without disabilities, and whose parents went to university. Students who don’t fall into those categories are also more likely to turn for support to spelling and grammar checkers like Grammarly, which also uses generative AI to offer stylistic suggestions, putting them at risk of running foul of AI detectors even when the substantive ideas are original. Innocent students will inevitably find themselves in a kind of Kafkaesque computational scenario – accused by one automated software of improperly relying on another.

What is to be done? In the desperate – and largely futile – scramble to “catch up” with AI, there is a real danger that academics lose sight of why we assign essays in the first place: to give students the opportunity to display their ability to evaluate information, think critically and present original arguments. This may even be an opportunity to move away from the sort of conventional essay questions that can so easily be fed into ChatGPT. Students can present original, critical work in presentations, podcasts, videos and reflective writing.

It’s also possible to ask students questions that include information that doesn’t exist in ChatGPT’s training data – for instance, by incorporating content generated in class. Lecturers could also address ambient AI anxiety head-on by prompting ChatGPT with assigned essay questions and asking students to critique the resulting output in class. The goal need not be just to fend off AI-generated essays: expanding the range of assessments can also help universities close the achievement gap that exists in part because traditional forms of assessment tend to favour more privileged students.

Of course, all of this puts more pressure on casualised, overworked staff, which is why the kneejerk response to revert to closed-door, handwritten exams is understandable, if misguided. We can be critical of AI, but we can’t pretend it doesn’t exist if we want to prepare students for a world where humans will have to live and work alongside thinking machines.

Achieving that goal would be easier if students arrived at university as open-minded critical thinkers instead of stressed-out, debt-burdened consumers. In this sense, the panic around AI is only the latest symptom of a broader crisis at UK universities, and it is a crisis that is not equally felt. The Conservative government’s move to force universities to cap admissions on “low-value” degrees – cutting off their primary funding source – is also an attack on working-class and minority-ethnic students. Responding to AI with punitive measures based on unreliable detection software risks contributing to that same attack. If there is any chance of avoiding the bitter irony of the biggest technological breakthrough since the internet entrenching enduring inequalities in education, it will come from lecturers working with AI, instead of joining a losing fight against it.

Robert Topinka is a senior lecturer in media and cultural studies at Birkbeck, University of London

  • Artificial intelligence (AI)
  • Universities
  • Higher education

Most viewed

  • Share full article

Advertisement

Supported by

Guest Essay

Every FAFSA Delay Puts College Further Out of Reach

An illustration of a figure wearing a varsity jacket emblazoned with the words “Insufficient Funds.”

By Justin Draeger and Ted Mitchell

Mr. Draeger is president and C.E.O. of the National Association of Student Financial Aid Administrators. Mr. Mitchell is president of the American Council on Education and a former U.S. under secretary of education.

Each year, more than 17 million students complete the Free Application for Federal Student Aid, or FAFSA, hoping to secure the financial support they need to afford college. But this year, operational glitches and repeated delays in the U.S. Department of Education’s “Better FAFSA” rollout threaten to harm the very students and families that financial aid is intended to help. Despite promises of an easier, more straightforward application process, students and families so far have been met with glitches and delays, and still today, there are entire groups of students blocked from even completing the form.

The department notified schools on Jan. 30, the day that they were supposed to get detailed information to determine how much aid was available for each student, that schools would not receive that data until sometime in the first half of March, leaving colleges scrambling to determine how best to issue aid offers as soon as possible. Students may not receive financial aid offers until April and are typically expected to make a decision about where to attend college by May 1. The traditional “college decision day” simply may not work for students this year since many will not have had the time they need to consider all of their financial options.

This problematic rollout is causing more than just an administrative headache. For students — and even schools themselves — the ripple effect could be catastrophic. Federal financial aid programs were created to open the doors to higher education, bringing a dream within reach for some who would otherwise be unable to unlock that future. Those who can least afford to pay for college will be the most adversely affected.

Some students may be pressured into making one of the most significant financial decisions of their lives without having a complete picture of their options. Others may delay enrolling in college for another year, once the aid application process is running more smoothly. Or, worse yet, some students may become so frustrated by the complexity and confusion of this year’s financial aid process that they give up altogether, forgoing pursuit of a postsecondary degree or credential that would add to their earnings and provide all the other benefits that come with additional education.

Meanwhile, many colleges and universities are stuck in a holding pattern. Schools, state agencies and private scholarship providers rely on FAFSA data to determine how to distribute their own financial aid dollars. Without that information, financial aid offices can’t begin the work of putting together aid offers for students or even precise timelines about when students will receive them.

And without those aid offers, students can’t — really, shouldn’t — decide where to enroll. Financial aid offices are feeling pressure from students and families who are rightfully frustrated and confused as to why they haven’t been given any information on aid packages, as well as from institutional leaders who are eager to finalize their incoming class and budget for the year ahead.

Colleges and universities must now move to take corrective action, and fast. That is why we and other higher education association leaders are urging schools to extend financial aid and enrollment deadlines beyond the traditional May 1 date.

For years, students, families, college guidance and admissions professionals and researchers have known that the FAFSA was too complicated, lengthy and daunting, causing many qualified students to skip filling out the form and miss out on the aid they’re entitled to. In 2020, Congress ordered the Department of Education to overhaul the form by asking fewer questions and relying on technology to obtain key information already gathered by other federal agencies, such as the I.R.S.

Congress provided no additional funds to help roll out a new FAFSA. At the same time, Federal Student Aid, the office in the Department of Education responsible for the FAFSA, was working to revamp the student lending system while creating numerous new loan forgiveness efforts, including the expansive plan that was ultimately blocked by the Supreme Court last summer. Big ambitions and limited resources most likely contributed to the problems we’re seeing now.

Given these challenges, Congress granted the department an extra year to pull off this huge system overhaul. But even with a three-year development runway, when the 2024-25 FAFSA finally “soft launched” nearly three months later than usual on Dec. 30, 2023, it did so with very limited availability: less than an hour a day for the first couple of days. While the form is now available 24/7 and more than three million students have been able to complete it, some applicants in special family circumstances and those who make simple mistakes on the form still cannot log back in to correct and resubmit.

Moving forward, the Department of Education must meet its own timelines, putting aside blame and finger-pointing to provide the higher education community with better, continuing and more proactive communication about the FAFSA rollout. Colleges, financial aid offices, high school guidance counselors and millions of students simply cannot make plans around last-minute delays and surprises.

The responsibility to make sure that students and families get the information they need, when they need it, in time to make educated decisions about college cannot be delayed.

Justin Draeger is president and C.E.O. of the National Association of Student Financial Aid Administrators. Ted Mitchell is president of the American Council on Education and a former U.S. under secretary of education.

The Times is committed to publishing a diversity of letters to the editor. We’d like to hear what you think about this or any of our articles. Here are some tips . And here’s our email: [email protected] .

Follow the New York Times Opinion section on Facebook , Instagram , TikTok , X and Threads .

IMAGES

  1. Example of 12 tuples of student dataset

    student essay dataset

  2. Sample student dataset generated by automatic exam system

    student essay dataset

  3. Sample of student dataset [4].

    student essay dataset

  4. Data Analysis Between Students Essay (300 Words)

    student essay dataset

  5. Automated Essay Scoring Dataset

    student essay dataset

  6. [PDF] Automated Essay Grading Using Machine Learning

    student essay dataset

COMMENTS

  1. ASAP Dataset

    Dataset Loaders Edit No data loaders found. You can submit your data loader here. Tasks Edit Automated Essay Scoring Unknown Modalities Edit Texts Languages Edit English There are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response.

  2. The Hewlett Foundation: Automated Essay Scoring

    Develop an automated scoring algorithm for student-written essays.

  3. Explainable Automated Essay Scoring: Deep Learning Really Has

    Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize's essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions.

  4. Automated Essay Scoring and the Deep Learning Black Box: How ...

    This article: a) analyzes the datasets from the contest - which contained hand-graded essays - to measure their suitability for developing competent automated grading tools; b) evaluates the potential for deep learning in automated essay scoring (AES) to produce sophisticated testing and grading algorithms; c) advocates for thorough and transpar...

  5. Deep Learning in Automated Essay Scoring

    It uses the essay dataset #8 from the Automated Student Assessment Prize competition, hosted by the Kaggle platform, and a state-of-the-art Suite of Automatic Linguistic Analysis Tools (SALAT) to extract 1,463 writing features. A non-linear regressor deep neural network is trained to predict holistic scores on a scale of 10-60.

  6. Datasets for Automatic Scoring of Educational Text Data

    This page provides an overview of freely available automatic free-text scoring datasets from the areas of automatic essay grading and content/short-answer scoring. This page is maintained by CATALPA; FernUniversität in Hagen, Germany. Did we miss your dataset? Is there an error in how we represented your data in the table?

  7. Essay Scoring Datasets

    Automated essay grading is one of the most important educational applications of NLP. Given an essay, an automated essay grader's goal is to provide annotations with respect to various dimensions of essay quality such as coherence, argument strength, and technical quality, all without human intervention.

  8. An automated essay scoring systems: a systematic literature review

    All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test.

  9. Score It All Together: A Multi-Task Learning Study on Automatic Scoring

    On the recently published student essay dataset PERSUADE, we first show that the automatic scoring of argument quality benefits from additional information about context, writing prompt and argument type. We then explore the different combinations of three tasks: automated span detection, type and quality prediction.

  10. PDF Automated essay scoring using efficient transformer-based language models

    XLNet have been shown to be very solid on the Kaggle dataset [33, 47]. To our knowledge, combining BERTwith hand-crafted features form the current state-of-the-art [40]. The recent works of have challenged the paradigm that bigger models are necessarily better [11, 21, 24, 29]. The models in these papers possess some fundamental architectural ...

  11. Automated Essay Scoring: Kaggle Competition

    Dataset: For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response.

  12. An automated essay scoring systems: a systematic literature review

    All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test.

  13. Automated Student Assessment and Essay Generator

    The Hewlett Foundation has provided a set of high school student essays along with scores generated by human expert graders. The initial data was released in 2012 1 as part of a Kaggle competition to produce an automated student assessment algorithm to closely match the human scores.

  14. Student Essay Dataset

    Student Essay is widely used in research on argument segmentation. Student Essay is widely used in research on argument segmentation. Browse State-of-the-Art ... The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the ...

  15. PDF arXiv:2401.06431v1 [cs.CL] 12 Jan 2024

    essay scoring dataset for AES tasks. The dataset was carefully curated in collaboration with 29 high schools in China, encompassing a total of 6,559 student essays that were part of a final exam. To ensure privacy and confidentiality, all personal in-formation related to the students has been removed from the dataset. The evaluation of these ...

  16. PDF ASAP++: Enriching the ASAP Automated Essay Grading Dataset with Essay

    Automated Essay Grading (AEG) dataset to train and test systems3. The ASAP AEG dataset comprises of approxi-mately 13,000 essays, written across 8 prompts. The essays were written by students of class 7 to 10. Each essay was evaluated by 2 evaluators. 6 out of the 8 prompts only have overall scores. Only 2 of them have scores for individual

  17. A Bi-modal Automated Essay Scoring System for Handwritten Essays

    We evaluate VisualAES with the public Automated Student Assessment Prize (ASAP) dataset and our proposed handwritten Chinese Students' handwritten essay dataset (ChnStd). Results show that the proposed VisualAES outperforms all state-of-the-art methods on both datasets. More importantly, by incorporating handwritten image information, we also ...

  18. Hidding the Ghostwriters: An Adversarial Evaluation of

    This paper aims to bridge this gap by constructing AIG-ASAP, an AI-generated student essay dataset, employing a range of text perturbation methods that are expected to generate high-quality essays while evading detection. Through empirical experiments, we assess the performance of current AIGC detectors on the AIG-ASAP dataset.

  19. PDF GHOSTBUSTER: DETECTING TEXT GHOSTWRITTEN BY LARGE L MODELS

    In conjunction with our model, we release three new datasets of human- and AI-generated text as detection benchmarks in the domains of student essays, creative writing, and news articles. We compare Ghostbuster to a variety of existing detectors, including DetectGPT and GPTZero, as well as a new RoBERTa baseline.

  20. One-Class Learning for AI-Generated Essay Detection

    Results with two datasets containing essays in L2 English and L2 Spanish show that it is feasible to accurately detect AI-generated essays. The analysis reveals which models and which sets of linguistic features are more powerful than others in the detection task. Keywords: essay classification; one-class learning; linguistics; L2 writing; ChatGPT

  21. essays-dataset · GitHub Topics · GitHub

    essays-dataset Star Here are 2 public repositories matching this topic... amirmohammadkz / personality_detection Star 27 Code Issues Pull requests BB-SVM model for automatic personality detection of the essays dataset (Big-Five personality labeled traits)

  22. Announcing the NeurIPS 2023 Paper Awards

    Two Outstanding Datasets and Benchmark Track Papers This year's organizers received a record number of paper submissions. Of the 13,300 submitted papers that were reviewed by 968 Area Chairs, 98 senior area chairs, and 396 Ethics reviewers 3,540 were accepted after 502 papers were flagged for ethics reviews.

  23. How are computers scoring STAAR essays? Texas superintendents, lawmaker

    The use of computers to score essays "was never communicated to school districts; yet this seems to be an unprecedented change that a 'heads up' would be reasonably warranted," HD Chambers ...

  24. Is the University of Michigan Selling Student Data to Train AI?

    The school has also allegedly sold a dataset of 829 academic papers from students to help fine tune large language models (LLMs) as well.

  25. Chinese Essay Dataset For Organization Evaluation

    We built a Chinese argumentative student essay dataset for Organization Evaluation and Sentence Function Identification. Task Definition Organization Evaluation (OE). Evaluating the organization of argumentative student essays. Sentence Function Identification (SFI). Identifying the function of sentence in argumentative student essays.

  26. Literacy crisis in college students: Essay from a professor on students

    Yes, there were always students who skipped the readings, but we are in new territory when even highly motivated honors students struggle to grasp the basic argument of a 20-page article.

  27. [2402.06196] Large Language Models: A Survey

    Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \\cite{kaplan2020scaling ...

  28. The software says my student cheated using AI. They say they're

    Essays are marked anonymously, so my heart dropped when I found out that the first "100% AI-generated" essay I marked belonged to a brilliant, incisive thinker whose essays in the pre-ChatGPT ...

  29. Opinion

    Each year, more than 17 million students complete the Free Application for Federal Student Aid, or FAFSA, hoping to secure the financial support they need to afford college. But this year ...