

技术领域technical field
本发明属于智慧教育技术领域,涉及一种基于决策树的学习成绩预测及个性化干预的方法。The invention belongs to the technical field of smart education, and relates to a method for learning achievement prediction and individualized intervention based on a decision tree.
背景技术Background technique
在在线教育平台中,一个学习者的学习行为会被系统进行详细的记录,研究者收集这些数据,再通过教育大数据对这些数据进行分析,获得学习者的特征,以便提供个性化的推送。其中,学习者与学习成绩之间有许多制约因素,但目前大多数对在线教育中学习成绩预测的研究仅仅考虑了学习者的行为特征,并未考虑学习者的学术背景特征、家庭特征以及学习者的状态变化特征等基本信息,而且筛选的学习者的行为特征也很局限和单一,并将其作为测试指标时,数据类型和样本的数量都十分有限。同时,当预测到影响学习成绩的行为特征并进行干预时,并未根据学习者的个体差异,给出个性化的干预方案。In the online education platform, a learner's learning behavior will be systematically recorded in detail. The researchers collect this data, and then analyze the data through educational big data to obtain the characteristics of the learner in order to provide personalized push. Among them, there are many constraints between learners and academic performance, but most of the current research on academic performance prediction in online education only considers the behavioral characteristics of learners, but does not consider the academic background characteristics, family characteristics and learning characteristics of learners. The basic information such as the state change characteristics of the learners, and the behavior characteristics of the screened learners are also limited and single, and when they are used as test indicators, the data types and the number of samples are very limited. At the same time, when predicting the behavioral characteristics that affect academic performance and intervening, no individualized intervention plan is given according to the individual differences of learners.
发明内容SUMMARY OF THE INVENTION
本发明的目的是提供一种基于决策树的学习成绩预测及个性化干预的方法,解决了现有技术中存在的对学习者学习成绩预测时仅仅考虑了学习者的行为特征且预测到影响学习成绩的行为特征并进行干预时,并未根据学习者的个体差异给出个性化的干预方案的问题。The purpose of the present invention is to provide a method for predicting learning performance and individualized intervention based on decision tree, which solves the problem in the prior art that when predicting the learning performance of the learner, only the behavioral characteristics of the learner are considered and the influence of learning is predicted. When intervening on the behavioral characteristics of grades and intervening, individualized intervention programs are not given according to individual differences of learners.
本发明所采用的技术方案是,一种基于决策树的学习成绩预测及个性化干预的方法,具体按照如下步骤实施:The technical solution adopted in the present invention is a method for predicting academic performance and individualized intervention based on a decision tree, which is specifically implemented according to the following steps:
步骤1,采集学习者的学习行为数据,包括静态数据和动态数据,静态数据包括一大类:学习者的基本信息数据,动态数据包括四大类:学习者在自适应平台的界面交互数据、内容交互数据、测试数据、状态变化数据;Step 1: Collect learning behavior data of learners, including static data and dynamic data. Static data includes one major category: basic information data of learners, and dynamic data includes four categories: learner interaction data on the interface of the adaptive platform, Content interaction data, test data, state change data;
步骤2,对步骤1采集的学习行为数据进行处理;Step 2, processing the learning behavior data collected in step 1;
步骤3,对经过步骤2处理的学习行为数据中动态数据的进行量化;Step 3, quantify the dynamic data in the learning behavior data processed in step 2;
步骤4,分别选取经步骤2处理的静态数据和经步骤3量化的动态数据中的一部分作为训练集数据,计算训练集中数据之间的相关性,根据相关性确定作为学习成绩预测指标的学习行为变量;Step 4, respectively select a part of the static data processed in step 2 and the dynamic data quantified in step 3 as training set data, calculate the correlation between the data in the training set, and determine the learning behavior as a learning performance predictor according to the correlation variable;
步骤5,将训练集中经步骤4确定作为学习成绩预测指标的学习行为对应的训练数据保留,提出其他训练数据得到更新后的训练集,应用决策树算法使用更新后的训练集进行学习成绩预测;Step 5, retain the training data corresponding to the learning behavior determined as the learning performance prediction index in the training set in step 4, propose other training data to obtain the updated training set, and apply the decision tree algorithm to use the updated training set to predict learning performance;
步骤6,按照步骤4-5的步骤分别选取经步骤2处理的静态数据和经步骤3量化的动态数据中的一部分作为测试集数据,生成最终的决策树预测模型去预测学习成绩结果;Step 6, according to the steps of steps 4-5, select respectively a part in the static data processed through step 2 and the dynamic data quantified through step 3 as test set data, and generate the final decision tree prediction model to predict the academic performance result;
步骤7,通过精确率、准确率和召回率对预测结果进行判断,当精确率、准确率和召回率任意一个不小于90%时,将对应的学习行为指标可以被作为影响学习者成绩的学习行为指标;Step 7: Judging the prediction results by the precision rate, accuracy rate and recall rate. When any one of the precision rate, accuracy rate and recall rate is not less than 90%, the corresponding learning behavior index can be used as the learning that affects the learner's performance. behavioral indicators;
步骤8,根据步骤7确定的行为指标,进行K-Means聚类,确定学习群体;Step 8, according to the behavior index determined in step 7, carry out K-Means clustering, and determine the learning group;
步骤9,根据步骤8确定的具有相同学习行为的不同学习者群体,对于每一类学习者群体提供不同的学习方案。Step 9, according to the different learner groups with the same learning behavior determined in step 8, provide different learning plans for each type of learner group.
本发明的特征还在于,The present invention is also characterized in that,
步骤1中的基本信息数据包括以下小类:户籍所在地、户籍类型、家庭情况、父母受教育程度、是否为寄宿生、是否为留守儿童;The basic information data in step 1 includes the following sub-categories: household registration location, household registration type, family situation, parents' education level, whether they are boarders, and whether they are left-behind children;
步骤1中的学习者在自适应平台的界面交互数据包括以下小类:查看测试结果、浏览资料、浏览公告、帖子的浏览、课程信息的浏览、知识点掌握程度的浏览和点赞量;The interactive data of learners on the interface of the adaptive platform in step 1 includes the following sub-categories: viewing test results, browsing materials, browsing announcements, browsing posts, browsing course information, browsing and likes of knowledge points mastery;
步骤1中的学习者在自适应平台的内容交互数据包括以下小类:课件下载量、视频下载量、观看视频次数、试卷下载次数、发帖量、回帖量、在线答题次数、提交作业数次数、参与测试评论量和被评论量;The content interaction data of learners on the adaptive platform in step 1 includes the following sub-categories: the number of downloads of courseware, the number of video downloads, the number of videos watched, the number of test paper downloads, the number of posts, the number of replies, the number of online answers, the number of submitted assignments, Participate in testing the number of comments and the number of comments;
步骤1中学习者在自适应平台的测试数据包括以下小类:测试成绩、测试难度、知识点掌握程度、正确率、每道题答题时长、测试消耗时长和测试类型;In step 1, the learner's test data on the adaptive platform includes the following sub-categories: test score, test difficulty, degree of mastery of knowledge points, correct rate, answering time for each question, test consumption time and test type;
步骤1中学习者在自适应平台的状态变化数据包括以下小类:登录次数、系统登录时长、累计在线时长、在线时间间隔、离开平台的时长、离开平台的次数和登录时间离线时间。In step 1, the learner's state change data on the adaptive platform includes the following sub-categories: the number of logins, the system login time, the cumulative online time, the online time interval, the time to leave the platform, the number of times to leave the platform, and the login time and offline time.
步骤2具体为:Step 2 is specifically:
步骤2.1,数据清洗:对步骤1采集的数据根据每个变量的合理取值范围和相互关系,检查所拿到的学习者数据是否合乎要求,如果发现超出正常范围、逻辑上不合理或者相互矛盾的数据,对其进行核对和纠正;如果学习者的某些行为数据是无效值和缺失值;Step 2.1, data cleaning: For the data collected in step 1, check whether the acquired learner data meets the requirements according to the reasonable value range and mutual relationship of each variable. If it is found to be out of the normal range, logically unreasonable or contradictory , check and correct it; if some of the learner's behavioral data are invalid and missing values;
其中,若某些学习行为数据与现实情况不符则认为为无效值;Among them, if some learning behavior data does not match the actual situation, it is considered as invalid value;
若某些学习者的学习行为数据不完整,即就是不完全包括所有学习行为的各个类别,则认为将对应学习者的其他包括的学习行为数据视为缺失值,将其删除;If the learning behavior data of some learners is incomplete, that is, it does not completely include all categories of learning behaviors, it is considered that other included learning behavior data of the corresponding learners are regarded as missing values, and they are deleted;
步骤2.2,将清洗的数据整合在一起;Step 2.2, integrate the cleaned data together;
步骤2.3,将进行整合的学习行为数据转换成字符串类型,其中,转换后的浮点型字符串保留两位小数。Step 2.3: Convert the integrated learning behavior data into a string type, wherein the converted floating-point string retains two decimal places.
步骤3具体为:Step 3 is specifically:
步骤3.1,学习者在自适应平台的界面交互数据的量化:Step 3.1, quantification of learner interaction data on the interface of the adaptive platform:
测试结果和点赞量直接在自适应平台上根据学习者查看测试结果以及点赞的点击次数来进行累加获得;The test results and the number of likes are directly accumulated on the adaptive platform according to the number of clicks of the learners viewing the test results and likes;
浏览资料、浏览公告、帖子的浏览、课程信息的浏览、知识点掌握程度的浏览按照学习者对对应资源的浏览时长来进行统计的,学习者每次开始浏览到浏览结束的这段时间即为学习者本次的浏览时长,而总的浏览时长即为每次的浏览时长累加得到,具体如下:Browsing materials, browsing announcements, browsing posts, browsing course information, and browsing knowledge points are counted according to the length of the learner's browsing of the corresponding resources. The learner’s browsing time this time, and the total browsing time is obtained by accumulating each browsing time, as follows:
学习者对对应资源的浏览时长按照如下公式计算:The browsing time of the learner to the corresponding resource is calculated according to the following formula:
其中,Sscan表示学习者浏览某资源的浏览时长,Tleave和Tenter表示学习者浏览某个资源的离开时间和进入时间,t表示一个学习者访问该资源的次数;Among them, Sscan represents the browsing time of a learner browsing a resource, Tleave and Tenter represent the leave time and entry time of a learner browsing a resource, and t represents the number of times a learner accesses the resource;
步骤3.2,学习者在自适应平台的内容交互数据的量化:Step 3.2, quantification of learners' content interaction data on the adaptive platform:
根据学习在自适应平台的学习记录直接对对应的行为进行累加求和获取课件下载量、视频下载量、观看视频次数、试卷下载次数、发帖量、回帖量、在线答题次数、提交作业数次数、参与测试评论量和被评论量;According to the learning records of learning in the adaptive platform, the corresponding behaviors are directly accumulated and summed to obtain the number of downloads of courseware, the number of video downloads, the number of videos watched, the number of test paper downloads, the number of posts, the number of replies, the number of online answers, the number of submitted assignments, Participate in testing the number of comments and the number of comments;
步骤3.3,学习者在自适应平台的测试数据的量化Step 3.3, quantification of the learner's test data on the adaptive platform
正确率是指在一次阶段测试中答对题目所占比例,定义一套试卷的题目集合为:Q={q1,q2,q3,…,qm},做错题目的集合为:E={e1,e2,e3,…,ek},,则正确率可表示为:The correct rate refers to the proportion of correctly answered questions in one stage test. The set of questions in a set of test papers is defined as: Q={q1 , q2 , q3 ,…,qm }, and the set of wrong questions is: E ={e1 ,e2 ,e3 ,...,ek }, then the correct rate can be expressed as:
其中,Tcorrect表示正确率,和表示题目总数和做错的题目数;Among them, Tcorrect represents the correct rate, and Indicates the total number of questions and the number of wrong questions;
学习者的知识点掌握程度为:The degree of mastery of the knowledge points of the learner is as follows:
在对应的知识点下分别选取简单、中等、难题分别x个,其中,简单题的难度值范围为:0-0.3,中等题的难度值范围为:0.4-0.7,难题的难度值范围为:0.8-1.0,则该知识点的掌握程度W按照如下公式计算:Under the corresponding knowledge points, select x simple, medium and difficult questions respectively. The difficulty value range of simple questions is: 0-0.3, the difficulty value range of medium questions is: 0.4-0.7, and the difficulty value range of difficult questions is: 0.8-1.0, then the mastery degree W of the knowledge point is calculated according to the following formula:
其中,B、N、H分别表示简单题、中等题和难题的难度总值,表示答对简单题的难度值之和,表示答对中等题的难度值之和,表示答对难题的难度值之和;Among them, B, N, and H represent the total difficulty of simple, medium and difficult questions, respectively. Represents the sum of the difficulty values of answering simple questions correctly, Indicates the sum of the difficulty values of correct answers to the intermediate questions, Indicates the sum of the difficulty values of correctly answered puzzles;
测试难度由一套试卷中每个题目的难度累加求均值获得的,一套试卷单个题目的难度可定义为:F={f1,f2,f3,…,fM},则难度可表示为:The difficulty of the test is obtained by the cumulative average of the difficulty of each question in a set of test papers. The difficulty of a single question in a set of test papers can be defined as: F={f1 ,f2 ,f3 ,...,fM }, then the difficulty can be Expressed as:
其中,Fdifficult表示测试难度,表示该测试卷所有题目的难度值之和,M表示该次测试试卷的题目总数,E为学习者的优秀指数;Among them, Fdifficult represents the test difficulty, Represents the sum of the difficulty values of all the questions in the test paper, M represents the total number of questions in the test paper, and E is the learner's excellence index;
其中,in,
其中,a为学习者学习知识点的个数,pi为学习者对知识点p的掌握程度,根据公式(3)计算得到;Among them, a is the number of knowledge points learned by the learner, and pi is the learner's mastery of knowledge point p, which is calculated according to formula (3);
测试成绩根据该次测试学习每道题目的分数累加计算得到;The test score is calculated by accumulating the scores of each topic in the test;
测试类型在自适应平台中直接获取,测试类型包括学前测试和学情测试两种类型,其中学前测试用0表示,学情测试用1表示;The test type is directly obtained in the adaptive platform. The test type includes two types: pre-school test and learning situation test. The pre-school test is represented by 0, and the learning situation test is represented by 1;
每道题答题时长指学习者在点击进入该题目到点击进入下一题的时间间隔,在自适应平台中直接获取;The answering time of each question refers to the time interval between when the learner clicks to enter the question and clicks to enter the next question, which is directly obtained in the adaptive platform;
测试消耗时长指一套试卷的消耗时长,表示为其中,Ti为第i个题目答题时长,M为该次测试试卷的题目总数;The test consumption time refers to the consumption time of a set of test papers, which is expressed as Among them, Ti is the answering time of the i-th question, and M is the total number of questions in the test paper;
步骤3.4,学习者在自适应平台的状态变化数据的量化Step 3.4, quantification of the learner's state change data in the adaptive platform
登录次数、系统登录时长、累计在线时长、在线时间间隔、离开平台的时长、离开平台的次数和登录时间、离线时间根据学习者在平台的访问频数累加求得。The number of logins, system login time, cumulative online time, online time interval, time to leave the platform, number of times to leave the platform, login time, and offline time are calculated based on the learner's access frequency on the platform.
步骤4具体为:Step 4 is specifically:
步骤4.1,分别选取经步骤2处理的静态数据的80%和经步骤3量化的动态数据的80%放入训练集,作为训练数据;Step 4.1, respectively select 80% of the static data processed in step 2 and 80% of the dynamic data quantified in step 3 and put into the training set as training data;
步骤4.2,计算训练集中任意两个训练数据皮尔森相关系数r,若训练数据为对应的静态数据,则训练数据为经步骤2转换后的字符串值,若训练数据对应动态数据,则训练数据为经步骤3量化后的值,具体按照如下公式计算:Step 4.2: Calculate the Pearson correlation coefficient r of any two training data in the training set. If the training data is the corresponding static data, the training data is the string value converted in step 2. If the training data corresponds to the dynamic data, the training data is the value quantified in step 3, and is calculated according to the following formula:
其中,g为训练集中训练数据的总数,Zi为训练集中某类学习行为对应的一个训练数据,为训练集中某类学习行为对应的所有训练数据的均值,Yi为训练集中不同于Zi的某类学习行为对应的某个训练数据,为训练集不同于Zi的某类学习行为对应的所有训练数据的均值;Among them, g is the total number of training data in the training set, Zi is a training data corresponding to a certain type of learning behavior in the training set, is the mean of all training data corresponding to a certain type of learning behavior in the training set, Yi is a certain training data corresponding to a certain type of learning behavior different from Zi in the training set, is the mean of all training data corresponding to a certain type of learning behavior whose training set is different from Zi ;
步骤4.3,若计算的r>0或者r<0,则Zi和Yi对应的学习行为均作为学习成绩预测指标的学习行为变量。Step 4.3, if the calculated r>0 or r<0, then the learning behaviors corresponding to Zi and Yi are both used as learning behavior variables of the learning performance predictor.
步骤5中应用决策树算法使用更新后的训练集进行学习成绩预测具体按照如下步骤实施:In step 5, the decision tree algorithm is applied to use the updated training set to predict the academic performance according to the following steps:
步骤5.1,定义更新后的训练集为:D={(x1,y1),(x2,y2),...,(xN,yN)},其中,xi={xi(1),xi(2),...,xi(n)}T为输入的特征向量,将经步骤4筛选出的学习行为对应的小类作为一个特征,则xi中即就是某个学习行为的小类对应的所有值,n为对应的学习行为的小类对应的值的数量,yi∈{1,2,...,K}为小类标记;Step 5.1, define the updated training set as: D={(x1 ,y1 ),(x2 ,y2 ),...,(xN ,yN )}, where xi ={xi(1) ,xi(2) ,...,xi(n) }T is the input feature vector, and the subclass corresponding to the learning behavior screened in step 4 is used as a feature, then xi is is all the values corresponding to the subclass of a certain learning behavior, n is the number of values corresponding to the subclass of the corresponding learning behavior, and yi ∈{1,2,...,K} is the subclass mark;
步骤5.2,计算训练集中任意一个特征A对训练集D的信息增益比gr(D,A)按照如下公式计算:Step 5.2, calculate the information gain ratio gr(D, A) of any feature A in the training set to the training set D according to the following formula:
其中,g(D,A)为信息增益,具体按照如下公式计算:Among them, g(D, A) is the information gain, which is calculated according to the following formula:
g(D,A)=H(D)-H(D|A) (9)g(D,A)=H(D)-H(D|A) (9)
其中,Ck表示第k个小类,|Ck|为属于第k个小类Ck的样本数据量,其中|D|为训练集的样本容量;Among them, Ck represents the kth subclass, |Ck | is the amount of sample data belonging to the kth subclassCk , and |D| is the sample capacity of the training set;
其中,Di的划分是根据特征A的每个取值将训练集D分为n个子集D1,D2,...,Dn,由于Di是从训练集中分出来的子集,即Di={(xj,yj),(xj+1,yj+1),...,(xN,yN)},j表示第i训练集开始的位置,Di具体的划分是在生成决策树的过程中特征A的每个取值与阈值ε的大小关系进行划分数据集的,其中,n为特征A对应的取值数量,|Di|为Di的样本个数;在计算H(D|A)时,公式中的H(Di)是用公式(7)计算的,只需要将D替换为Di即可,此时,Ck表示数据集Di中的第k个小类,|Ck|为属于第k个小类Ck的样本数据量;Among them, the division of Di is to divide the training set D inton subsets D1 , D2 ,..., Dn according to each value of feature A. SinceDi is a subset from the training set, That is, Di ={(xj ,yj ),(xj+1 ,yj+1 ),...,(xN ,yN )}, j represents the starting position of the i-th training set, and Di The specific division is to divide the data set according to the relationship between each value of feature A and the threshold ε in the process of generating the decision tree, where n is the number of values corresponding to feature A, and |Di | is the value of Di The number of samples; when calculating H(D|A), H(Di ) in the formula is calculated by formula (7), just replace D with Di , at this time, Ck represents the data set The kth subclass in Di , |Ck | is the amount of sample data belonging to the kth subclass Ck ;
步骤5.3,依据步骤5.2计算每个小类特征的信息增益比,然后按照信息增益比从大到小将每个小类特征进行排序依次,选择信息增益比最大的特征Ag作为开始节点;Step 5.3, calculate the information gain ratio of each sub-class feature according to step 5.2, and then sort each sub-class feature according to the information gain ratio from large to small, and select the feature Ag with the largest informationgain ratio as the starting node;
步骤5.4,若特征Ag的信息增益比小于其对应小类的阈值ε,则将决策树T归为单结点树,并将训练集D中样本数据最大的小类Ck作为特征Ag的类,返回决策树T;Step 5.4, if the information gain ratio of the feature Ag is less than the threshold ε of its corresponding subclass, the decision tree T is classified as a single-node tree, and the subclass Ck with the largest sample data in the training set D is used as the feature Ag . The class of , returns the decision tree T;
若特征Ag的信息增益比大于等于其对应小类的阈值ε,则根据特征Ag的每个取值将训练集D划分为n个子集D1,D2,...,Dn,将Di中样本数据最大的小类作为开始节点的类,构建开始节点的子节点,则由开始结点及其子结点构成决策树T,并返回决策树T;If the information gain ratio of the feature Ag is greater than or equal to the threshold ε of its corresponding subclass, then according to each value of the feature Ag , the training set D is divided into n subsets D1 , D2 ,...,Dn , Taking the subclass with the largest sample data inDi as the class of the start node, and constructing the child nodes of the start node, the decision tree T is formed by the start node and its child nodes, and the decision tree T is returned;
步骤5.5,以Di为训练集,选择除了Ag以外的信息增益比最大的特征作为新的类,将其视为Ag,然后按照步骤5.4的方法执行;不断循环直到所有的特征全部分类完成,返回决策树T,预测结束,将学习成绩的合格或不合格作为最终预测结果。Step 5.5, take Di as the training set, select the feature with the largest information gain ratio except Ag as a new class, regard it as Ag , and then execute the method in step 5.4; keep looping until all the features are classified After completion, return to the decision tree T, the prediction is over, and the pass or fail of the academic performance is used as the final prediction result.
步骤6具体为:Step 6 is as follows:
将经步骤4.1选取之后剩余的静态数据的20%和动态数据的20%放入测试集,作为测试数据,按照步骤4.2-步骤5.6的方式对测试集数据进行操作,生成最终的决策树预测模型去预测学习成绩结果。Put 20% of the static data and 20% of the dynamic data remaining after the selection in step 4.1 into the test set, as test data, operate on the test set data in the manner of step 4.2-step 5.6 to generate the final decision tree prediction model to predict academic performance outcomes.
步骤7具体为:Step 7 is specifically:
步骤7.1,确定混淆矩阵;Step 7.1, determine the confusion matrix;
混淆矩阵中包括真实类别和预测类别以及真实类别和预测类别的比较结果,真实类别是学习者在自适应平台中真实的学习成绩的及格情况,预测类别是指通过测试集和决策树预测得到的学习成绩的及格情况,真实类别和预测类别的比较结果包括:The confusion matrix includes the comparison result between the real category and the predicted category and the real category and the predicted category. The real category is the passing situation of the learner's real academic performance in the adaptive platform, and the predicted category refers to the prediction obtained through the test set and decision tree. The passing status of academic performance, the comparison results of the real category and the predicted category include:
若真实类别为及格且预测类别为及格则表示被正确地划分为正例,计被正确地划分为正例的个数为TP;If the true category is pass and the predicted category is pass, it means that it is correctly classified as positive, and the number of correctly classified as positive is TP;
若真实类别为及格且预测类别为不及格则表示被错误地划分为负例,计被错误地划分为负例的个数为FN;If the true category is passing and the predicted category is failing, it means that it is wrongly classified as a negative example, and the number of wrongly classified as a negative example is FN;
若真实类别为不及格且预测类别为及格则表示被错误地划分为正例,计被错误地划分为正例的个数FP;If the true category is failed and the predicted category is passed, it means that it is wrongly classified as positive, and counts the number FP of wrongly classified as positive;
若真实类别为不及格且预测类别为不及格则表示被正确地划分为负例,及被正确地划分为负例的个数为TN;If the true category is failed and the predicted category is failed, it means that it is correctly classified as negative, and the number of correctly classified as negative is TN;
用P指真实类别中及格总数;N指真实类别中不及格的总数;P1指预测类别中及格总数,N1指预测类别中不及格总数;Use P to refer to the total number of passing grades in the real category; N refers to the total number of failures in the real category; P1 refers to the total number of passing grades in the predicted category, and N1 refers to the total number of failed grades in the predicted category;
步骤7.2,根据步骤7.1得到的混淆矩阵,计算预测的精确率、准确率和召回率,具体为:Step 7.2, according to the confusion matrix obtained in step 7.1, calculate the prediction precision, precision and recall, specifically:
精确率Precision:Precision:
准确率Accuracy:Accuracy:
召回率Recall:Recall Recall:
当精确率、准确率和召回率任意一个不小于90%时,认为该学习行为指标可以被作为影响学习者成绩的指标。When any one of precision, precision and recall is not less than 90%, it is considered that the learning behavior index can be used as an index that affects the learner's performance.
步骤8具体为:Step 8 is as follows:
步骤8.1,根据步骤7确定的影响学习者学习成绩的学习行为指标的个数k,确定得到k个聚类群体,给每个聚类群体选择一个中心点,即就是在每个群里中所选的随机数,即就是聚类中心,对每类学习行为指标对应的数据集中数据的平均值以及该学习行为指标对应的数据集中的每个学习者的数据通过欧式公式计算该数据到每一个聚类中心的距离,数据集为训练集和测试集的总和,如果距离u<0.5则将该学习者归类到本类的聚类群体中;Step 8.1, according to the number k of learning behavior indicators that affect the learner's academic performance determined in step 7, determine k clustering groups, and select a center point for each clustering group, that is, the The selected random number is the cluster center. The average value of the data in the data set corresponding to each type of learning behavior index and the data of each learner in the data set corresponding to the learning behavior index are calculated by the Euclidean formula. The distance of the cluster center, the data set is the sum of the training set and the test set, if the distance u<0.5, the learner is classified into the clustering group of this class;
欧式公式具体如下:The European formula is as follows:
其中,h表示数据集中的数据,l表示所选择的聚类中心;Among them, h represents the data in the dataset, and l represents the selected cluster center;
步骤8.2,根据步骤8.1对所有学习者进行按照学习行为指标进行归类,得到k个学习者群体。In step 8.2, according to step 8.1, all learners are classified according to the learning behavior index, and k groups of learners are obtained.
步骤9具体为:Step 9 is specifically:
根据步骤8获得的k类学习者群体,按照步骤8的聚类方法将学习者聚类到对应的学习者群体中,自适应平台中根据聚类的学习者群体推荐平台的相关资源、合适的题目以及推荐当前学的知识点和接下来需要学习的知识点;资源包括有视频、音频、图片、文档、PPT以及教案。According to the k-type learner group obtained in step 8, the learners are clustered into corresponding learner groups according to the clustering method in step 8, and the adaptive platform recommends the relevant resources of the platform, suitable Topics and recommended knowledge points currently learned and knowledge points that need to be learned next; resources include videos, audios, pictures, documents, PPT and lesson plans.
本发明的有益效果是The beneficial effects of the present invention are
(1)本发明有效的利用了学习者在自适应平台的学习行为数据和学习者的基本信息数据,其中学习者的基本信息数据包括:户籍所在地以及户籍类型、家庭情况、父母受教育程度、是否为寄宿生、是否为留守儿童等基本信息,学习行为数据包括:界面交互数据、内容交互数据、测试数据、状态变化数据,从多个角度考虑了可能影响学习成绩的因素,通过皮尔森相关系数进行相关性分析,筛选出可能影响学习成绩的学习行为,并进行标记,为学习成绩预测提供了基础,针对学习成绩预测的结果,通过K-Means聚类算法,得到学习者群体,再结合每个学习者的学习风格和学习偏好,为学习者提供个性化的干预方案。学习者不仅可以通过预测结果了解到自身薄弱环节以及不良的学习行为,还可以获得适合自身学习风格的个性化干预方案,从而提高学习者的学习效率和学习成绩。(1) The present invention effectively utilizes the learner's learning behavior data on the adaptive platform and the learner's basic information data, wherein the learner's basic information data includes: the location of household registration and household registration type, family situation, parents' education level, Basic information such as whether they are boarding students, whether they are left-behind children, etc. The learning behavior data includes: interface interaction data, content interaction data, test data, and state change data. Factors that may affect academic performance are considered from multiple perspectives, and the Pearson correlation coefficient is used. Correlation analysis is carried out to screen out learning behaviors that may affect academic performance, and mark them, which provides a basis for academic performance prediction. According to the results of academic performance prediction, the K-Means clustering algorithm is used to obtain the learner group, and then combined with each individual learners’ learning styles and learning preferences, and provide learners with individualized intervention programs. Learners can not only understand their own weak links and bad learning behaviors through the prediction results, but also obtain personalized intervention plans suitable for their own learning styles, thereby improving their learning efficiency and academic performance.
附图说明Description of drawings
图1是本发明一种基于决策树的学习成绩预测及个性化干预的方法的流程图;1 is a flowchart of a method for predicting academic performance and individualized intervention based on a decision tree of the present invention;
图2是本发明一种基于决策树的学习成绩预测及个性化干预的方法的实施框图。FIG. 2 is an implementation block diagram of a method for predicting learning performance and individualized intervention based on a decision tree according to the present invention.
具体实施方式Detailed ways
下面结合附图和具体实施方式对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
本发明一种基于决策树的学习成绩预测及个性化干预的方法,如图1-2所示,具体按照如下步骤实施:A method of learning achievement prediction and individualized intervention based on a decision tree of the present invention, as shown in Figures 1-2, is specifically implemented according to the following steps:
步骤1,采集学习者的学习行为数据,包括静态数据和动态数据,静态数据包括一大类:学习者的基本信息数据,动态数据包括四大类:学习者在自适应平台的界面交互数据、内容交互数据、测试数据、状态变化数据;Step 1: Collect learning behavior data of learners, including static data and dynamic data. Static data includes one major category: basic information data of learners, and dynamic data includes four categories: learner interaction data on the interface of the adaptive platform, Content interaction data, test data, state change data;
基本信息数据包括以下小类:户籍所在地、户籍类型、家庭情况、父母受教育程度、是否为寄宿生、是否为留守儿童;Basic information data includes the following sub-categories: location of household registration, type of household registration, family situation, education level of parents, whether they are boarders, whether they are left-behind children;
学习者在自适应平台的界面交互数据包括以下小类:查看测试结果、浏览资料、浏览公告、帖子的浏览、课程信息的浏览、知识点掌握程度的浏览和点赞量;The interface interaction data of learners on the adaptive platform includes the following sub-categories: viewing test results, browsing materials, browsing announcements, browsing posts, browsing course information, browsing and likes of knowledge points mastery;
学习者在自适应平台的内容交互数据包括以下小类:课件下载量、视频下载量、观看视频次数、试卷下载次数、发帖量、回帖量、在线答题次数、提交作业数次数、参与测试评论量和被评论量;The content interaction data of learners on the adaptive platform includes the following sub-categories: the number of courseware downloads, the number of video downloads, the number of videos watched, the number of test paper downloads, the number of posts, the number of replies, the number of online answers, the number of assignments submitted, and the number of test comments. and the number of comments;
步骤1中学习者在自适应平台的测试数据包括以下小类:测试成绩、测试难度、知识点掌握程度、正确率、每道题答题时长、测试消耗时长和测试类型;In step 1, the learner's test data on the adaptive platform includes the following sub-categories: test score, test difficulty, degree of mastery of knowledge points, correct rate, answering time for each question, test consumption time and test type;
学习者在自适应平台的状态变化数据包括以下小类:登录次数、系统登录时长、累计在线时长、在线时间间隔、离开平台的时长、离开平台的次数和登录时间离线时间;The status change data of learners on the adaptive platform includes the following sub-categories: number of logins, system login time, cumulative online time, online time interval, time to leave the platform, number of times to leave the platform, and offline time of login time;
步骤2,对步骤1采集的学习行为数据进行处理,具体为:Step 2, processing the learning behavior data collected in step 1, specifically:
步骤2.1,数据清洗:对步骤1采集的数据根据每个变量的合理取值范围和相互关系,检查所拿到的学习者数据是否合乎要求,如果发现超出正常范围、逻辑上不合理或者相互矛盾的数据,对其进行核对和纠正;如果学习者的某些行为数据是无效值和缺失值;Step 2.1, data cleaning: For the data collected in step 1, check whether the acquired learner data meets the requirements according to the reasonable value range and mutual relationship of each variable. If it is found to be out of the normal range, logically unreasonable or contradictory , check and correct it; if some of the learner's behavioral data are invalid and missing values;
其中,若某些学习行为数据与现实情况不符则认为为无效值,例如学习者在该平台的所有浏览时长不可以为负值,如果已有数据中存在这样的值,视其为无效值,将其删除;Among them, if some learning behavior data does not match the actual situation, it will be regarded as invalid value. For example, all the browsing time of the learner on the platform cannot be negative value. If there is such a value in the existing data, it will be regarded as invalid value and will be its deletion;
若某些学习者的学习行为数据不完整,即就是不完全包括所有学习行为的各个类别,则认为将对应学习者的其他包括的学习行为数据视为缺失值,将其删除,例如:一个学习者的所有学习行为数据中,只显示了该学习者的界面交互数据,将这样的数据视为缺失值,将其删除;If the learning behavior data of some learners is incomplete, that is, it does not completely include all categories of learning behaviors, it is considered that other included learning behavior data of the corresponding learners are regarded as missing values, and they are deleted, for example: a learning behavior In all the learning behavior data of the learner, only the interface interaction data of the learner is displayed, and such data is regarded as a missing value and deleted;
步骤2.2,将清洗的数据整合在一起;在对数据进行清洗后,需要处理后的数据整合在一起,方便在后期数据分析时使用;Step 2.2, integrate the cleaned data; after the data is cleaned, the processed data needs to be integrated together, which is convenient for use in later data analysis;
步骤2.3,将进行整合的学习行为数据转换成字符串类型,其中,转换后的浮点型字符串保留两位小数,在自适应平台中,学习者的所有数据都是以符合数据库要求的类型和方式存储在数据库中,而这种方式只是将学习者的数据进行了保存,起到一个收集数据的作用,本发明是要根据这些数据做具体的数据分析,所以这样数据不符合后期的数据分析,因此需要对学习者的学习行为数据进行类型转换,将文本类型转换为字符串类型,整型不需要变化,浮点型保留两位小数;Step 2.3, convert the integrated learning behavior data into a string type, wherein the converted floating-point string retains two decimal places. In the adaptive platform, all the data of the learner is in the type that meets the requirements of the database. And the method is stored in the database, and this method only saves the learner's data, which plays a role in collecting data. The present invention is to do specific data analysis according to these data, so this data does not conform to the later data. Therefore, it is necessary to perform type conversion on the learner's learning behavior data, convert the text type to string type, the integer type does not need to be changed, and the floating-point type retains two decimal places;
步骤3,对经过步骤2处理的学习行为数据中动态数据的进行量化;具体为:Step 3, quantify the dynamic data in the learning behavior data processed in step 2; specifically:
步骤3.1,学习者在自适应平台的界面交互数据的量化:Step 3.1, quantification of learner interaction data on the interface of the adaptive platform:
测试结果和点赞量直接在自适应平台上根据学习者查看测试结果以及点赞的点击次数来进行累加获得;The test results and the number of likes are directly accumulated on the adaptive platform according to the number of clicks of the learners viewing the test results and likes;
浏览资料、浏览公告、帖子的浏览、课程信息的浏览、知识点掌握程度的浏览按照学习者对对应资源的浏览时长来进行统计的,学习者每次开始浏览到浏览结束的这段时间即为学习者本次的浏览时长,而总的浏览时长即为每次的浏览时长累加得到,具体如下:Browsing materials, browsing announcements, browsing posts, browsing course information, and browsing knowledge points are counted according to the length of the learner's browsing of the corresponding resources. The learner’s browsing time this time, and the total browsing time is obtained by accumulating each browsing time, as follows:
学习者对对应资源的浏览时长按照如下公式计算:The browsing time of the learner to the corresponding resource is calculated according to the following formula:
其中,Sscan表示学习者浏览某资源的浏览时长,Tleave和Tenter表示学习者浏览某个资源的离开时间和进入时间,t表示一个学习者访问该资源的次数;Among them, Sscan represents the browsing time of a learner browsing a resource, Tleave and Tenter represent the leave time and entry time of a learner browsing a resource, and t represents the number of times a learner accesses the resource;
步骤3.2,学习者在自适应平台的内容交互数据的量化:Step 3.2, quantification of learners' content interaction data on the adaptive platform:
根据学习在自适应平台的学习记录直接对对应的行为进行累加求和获取课件下载量、视频下载量、观看视频次数、试卷下载次数、发帖量、回帖量、在线答题次数、提交作业数次数、参与测试评论量和被评论量;According to the learning records of learning in the adaptive platform, the corresponding behaviors are directly accumulated and summed to obtain the number of downloads of courseware, the number of video downloads, the number of videos watched, the number of test paper downloads, the number of posts, the number of replies, the number of online answers, the number of submitted assignments, Participate in testing the number of comments and the number of comments;
步骤3.3,学习者在自适应平台的测试数据的量化Step 3.3, quantification of the learner's test data on the adaptive platform
正确率是指在一次阶段测试中答对题目所占比例,定义一套试卷的题目集合为:Q={q1,q2,q3,…,qm},做错题目的集合为:E={e1,e2,e3,…,ek},,则正确率可表示为:The correct rate refers to the proportion of correctly answered questions in one stage test. The set of questions in a set of test papers is defined as: Q={q1 , q2 , q3 ,…,qm }, and the set of wrong questions is: E ={e1 ,e2 ,e3 ,...,ek }, then the correct rate can be expressed as:
其中,Tcorrect表示正确率,和表示题目总数和做错的题目数;Among them, Tcorrect represents the correct rate, and Indicates the total number of questions and the number of wrong questions;
学习者的知识点掌握程度为:The degree of mastery of the knowledge points of the learner is as follows:
在对应的知识点下分别选取简单、中等、难题分别x个,其中,简单题的难度值范围为:0-0.3,中等题的难度值范围为:0.4-0.7,难题的难度值范围为:0.8-1.0,则该知识点的掌握程度W按照如下公式计算:Under the corresponding knowledge points, select x simple, medium and difficult questions respectively. The difficulty value range of simple questions is: 0-0.3, the difficulty value range of medium questions is: 0.4-0.7, and the difficulty value range of difficult questions is: 0.8-1.0, then the mastery degree W of the knowledge point is calculated according to the following formula:
其中,B、N、H分别表示简单题、中等题和难题的难度总值,表示答对简单题的难度值之和,表示答对中等题的难度值之和,表示答对难题的难度值之和;Among them, B, N, and H represent the total difficulty of simple, medium and difficult questions, respectively. Represents the sum of the difficulty values of answering simple questions correctly, Indicates the sum of the difficulty values of correct answers to the intermediate questions, Indicates the sum of the difficulty values of correctly answered puzzles;
测试难度由一套试卷中每个题目的难度累加求均值获得的,一套试卷单个题目的难度可定义为:F={f1,f2,f3,…,fM},则难度可表示为:The difficulty of the test is obtained by the cumulative average of the difficulty of each question in a set of test papers. The difficulty of a single question in a set of test papers can be defined as: F={f1 ,f2 ,f3 ,...,fM }, then the difficulty can be Expressed as:
其中,Fdifficult表示测试难度,表示该测试卷所有题目的难度值之和,M表示该次测试试卷的题目总数,E为学习者的优秀指数;Among them, Fdifficult represents the test difficulty, Represents the sum of the difficulty values of all the questions in the test paper, M represents the total number of questions in the test paper, and E is the learner's excellence index;
其中,in,
其中,a为学习者学习知识点的个数,pi为学习者对知识点p的掌握程度,根据公式(3)计算得到;Among them, a is the number of knowledge points learned by the learner, and pi is the learner's mastery of knowledge point p, which is calculated according to formula (3);
测试成绩根据该次测试学习每道题目的分数累加计算得到;The test score is calculated by accumulating the scores of each topic in the test;
测试类型在自适应平台中直接获取,测试类型包括学前测试和学情测试两种类型,其中学前测试用0表示,学情测试用1表示;The test type is directly obtained in the adaptive platform. The test type includes two types: pre-school test and learning situation test. The pre-school test is represented by 0, and the learning situation test is represented by 1;
每道题答题时长指学习者在点击进入该题目到点击进入下一题的时间间隔,在自适应平台中直接获取;The answering time of each question refers to the time interval between when the learner clicks to enter the question and clicks to enter the next question, which is directly obtained in the adaptive platform;
测试消耗时长指一套试卷的消耗时长,表示为其中,Ti为第i个题目答题时长,M为该次测试试卷的题目总数;The test consumption time refers to the consumption time of a set of test papers, which is expressed as Among them, Ti is the answering time of the i-th question, and M is the total number of questions in the test paper;
步骤3.4,学习者在自适应平台的状态变化数据的量化Step 3.4, quantification of the learner's state change data in the adaptive platform
登录次数、系统登录时长、累计在线时长、在线时间间隔、离开平台的时长、离开平台的次数和登录时间、离线时间根据学习者在平台的访问频数累加求得;The number of logins, the system login duration, the cumulative online duration, the online time interval, the duration of leaving the platform, the number of leaving the platform, the login duration, and the offline duration are calculated based on the learner’s access frequency on the platform;
步骤4,分别选取经步骤2处理的静态数据和经步骤3量化的动态数据中的一部分作为训练集数据,计算训练集中数据之间的相关性,根据相关性确定作为学习成绩预测指标的学习行为变量;具体为:Step 4, respectively select a part of the static data processed in step 2 and the dynamic data quantified in step 3 as training set data, calculate the correlation between the data in the training set, and determine the learning behavior as a learning performance predictor according to the correlation variable; specifically:
步骤4.1,分别选取经步骤2处理的静态数据的80%和经步骤3量化的动态数据的80%放入训练集,作为训练数据;Step 4.1, respectively select 80% of the static data processed in step 2 and 80% of the dynamic data quantified in step 3 and put into the training set as training data;
步骤4.2,计算训练集中任意两个训练数据皮尔森相关系数r,若训练数据为对应的静态数据,则训练数据为经步骤2转换后的字符串值,若训练数据对应动态数据,则训练数据为经步骤3量化后的值,具体按照如下公式计算:Step 4.2: Calculate the Pearson correlation coefficient r of any two training data in the training set. If the training data is the corresponding static data, the training data is the string value converted in step 2. If the training data corresponds to the dynamic data, the training data is the value quantified in step 3, and is calculated according to the following formula:
其中,g为训练集中训练数据的总数,Zi为训练集中某类学习行为对应的一个训练数据,为训练集中某类学习行为对应的所有训练数据的均值,Yi为训练集中不同于Zi的某类学习行为对应的某个训练数据,为训练集不同于Zi的某类学习行为对应的所有训练数据的均值;Among them, g is the total number of training data in the training set, Zi is a training data corresponding to a certain type of learning behavior in the training set, is the mean of all training data corresponding to a certain type of learning behavior in the training set, Yi is a certain training data corresponding to a certain type of learning behavior different from Zi in the training set, is the mean of all training data corresponding to a certain type of learning behavior whose training set is different from Zi ;
步骤4.3,若计算的r>0或者r<0,则Zi和Yi对应的学习行为均作为学习成绩预测指标的学习行为变量;Step 4.3, if the calculated r>0 or r<0, the learning behaviors corresponding to Zi and Yi are both used as learning behavior variables of the learning performance predictor;
(1)若r>0,则说明数据集和训练集中的Zi和Yi之间是正相关,即变量Zi的值越大则变量Yi的值越大,那么像这种关系的变量说明学习者在日常学习过程中会因为其中一个表现的比较差而导致另一个出现差的结果,这种数据变量Zi和Yi对应的学习行为是需要作为学习成绩预测的指标进行筛选出来的。(1) If r>0, it means that there is a positive correlation between Zi andYi in the data set and the training set, that is, the larger the value of the variable Zi is, the larger the value of the variable Yi is. It shows that in the daily learning process of learners, because one of them performs poorly, the other will lead to poor results. The learning behaviors corresponding to these data variables Zi and Yi need to be screened out as indicators of learning performance prediction. .
(2)若r<0,则说明数据集和训练集中的Zi和Yi之间是负相关,即变量Zi的值越大则变量Yi的值越小,像这种关系的变量说明学习者在日常学习过程中会因为其中一个表现的比较的突出反而导致另一个出现差的结果,这种数据变量Zi和Yi对应的学习行为是需要作为学习成绩预测的指标进行筛选出来的。(2) If r<0, it means that there is a negativecorrelation betweenZi and Yi in the data set and the training set, that is, the larger the value of the variableZi is, thesmaller the value of the variable Yi will be. It shows that in the process of daily learning, because one of the performances is more prominent, the other will lead to poor results. The learning behaviors corresponding to the data variables Zi and Yi need to be screened out as indicators of learning performance prediction. of.
(3)若r=0,则说明数据集和训练集中的Zi和Yi之间不是线性相关的,即行为变量Zi的值越大则行为变量Yi之间可能存在其他关系,在进行学习成绩预测的过程中可以忽略这些指标。(3) If r=0, it means that there is no linearcorrelation betweenZi and Yi in the data set and the training set, that is, the larger the value of the behavior variableZi is, the other relationship mayexist between the behavior variables Yi. These indicators can be ignored in the process of academic performance prediction.
步骤5,将训练集中经步骤4确定作为学习成绩预测指标的学习行为对应的训练数据保留,提出其他训练数据得到更新后的训练集,应用决策树算法使用更新后的训练集进行学习成绩预测;Step 5, retain the training data corresponding to the learning behavior determined as the learning performance prediction index in the training set in step 4, propose other training data to obtain the updated training set, and apply the decision tree algorithm to use the updated training set to predict learning performance;
决策树通常是一个递归地选择最优特征并根据该特征对训练数据进行分割,使得对各个子数据集有一个最好的分类的过程。这一过程对应着对特征空间的划分,也对应着决策树的构建。开始,构建根结点,将所有训练数据都放在根结点。选择一个最优特征,按照这一特征将训练数据集分割成子集,使得各个子集有一个在当前条件下最好的分类。如果这些子集已经能够被基本正确分类,那么构建叶结点,并将这些子集分到所对应的叶结点中去:如果还有子集不能被基本正确分类,那么就对这些子集选择新的最优特征,继续对其进行分割,构建相应的结点。如此递归地进行下去,直至所有训练数据子集被基本正确分类,或者没有合适的特征为止。最后每个子集都被分到叶结点上,即都有了明确的类。A decision tree is usually a process of recursively selecting the optimal feature and splitting the training data according to that feature so that each sub-data set has a best classification. This process corresponds to the division of the feature space and the construction of the decision tree. To start, build the root node and put all the training data in the root node. Choose an optimal feature, and divide the training data set into subsets according to this feature, so that each subset has a best classification under the current conditions. If these subsets can be basically correctly classified, then construct leaf nodes and assign these subsets to the corresponding leaf nodes: if there are still subsets that cannot be basically correctly classified, then these subsets Select new optimal features, continue to segment them, and build corresponding nodes. This is done recursively until all training data subsets are substantially correctly classified, or there are no suitable features. Finally, each subset is divided into leaf nodes, that is, it has a clear class.
其中,应用决策树算法使用更新后的训练集进行学习成绩预测具体按照如下步骤实施:Among them, applying the decision tree algorithm to use the updated training set to predict the learning performance is specifically implemented according to the following steps:
步骤5.1,定义更新后的训练集为:D={(x1,y1),(x2,y2),...,(xN,yN)},其中,xi={xi(1),xi(2),...,xi(n)}T为输入的特征向量,将经步骤4筛选出的学习行为对应的小类作为一个特征,则xi中即就是某个学习行为的小类对应的所有值,n为对应的学习行为的小类对应的值的数量,yi∈{1,2,...,K}为小类标记;Step 5.1, define the updated training set as: D={(x1 ,y1 ),(x2 ,y2 ),...,(xN ,yN )}, where xi ={xi(1) ,xi(2) ,...,xi(n) }T is the input feature vector, and the subclass corresponding to the learning behavior screened in step 4 is used as a feature, then xi is is all the values corresponding to the subclass of a certain learning behavior, n is the number of values corresponding to the subclass of the corresponding learning behavior, and yi ∈{1,2,...,K} is the subclass mark;
步骤5.2,计算训练集中任意一个特征A对训练集D的信息增益比gr(D,A)按照如下公式计算:Step 5.2, calculate the information gain ratio gr(D, A) of any feature A in the training set to the training set D according to the following formula:
其中,g(D,A)为信息增益,具体按照如下公式计算:Among them, g(D, A) is the information gain, which is calculated according to the following formula:
g(D,A)=H(D)-H(D|A) (9)g(D,A)=H(D)-H(D|A) (9)
其中,Ck表示第k个小类,|Ck|为属于第k个小类Ck的样本数据量,其中|D|为训练集的样本容量;Among them, Ck represents the kth subclass, |Ck | is the amount of sample data belonging to the kth subclassCk , and |D| is the sample capacity of the training set;
其中,Di的划分是根据特征A的每个取值将训练集D分为n个子集D1,D2,...,Dn,由于Di是从训练集中分出来的子集,即Di={(xj,yj),(xj+1,yj+1),...,(xN,yN)},j表示第i训练集开始的位置,Di具体的划分是在生成决策树的过程中特征A的每个取值与阈值ε的大小关系进行划分数据集的,其中,n为特征A对应的取值数量,|Di|为Di的样本个数;在计算H(D|A)时,公式中的H(Di)是用公式(7)计算的,只需要将D替换为Di即可,此时,Ck表示数据集Di中的第k个小类,|Ck|为属于第k个小类Ck的样本数据量;Among them, the division of Di is to divide the training set D inton subsets D1 , D2 ,..., Dn according to each value of feature A. SinceDi is a subset from the training set, That is, Di ={(xj ,yj ),(xj+1 ,yj+1 ),...,(xN ,yN )}, j represents the starting position of the i-th training set, and Di The specific division is to divide the data set according to the relationship between each value of feature A and the threshold ε in the process of generating the decision tree, where n is the number of values corresponding to feature A, and |Di | is the value of Di The number of samples; when calculating H(D|A), H(Di ) in the formula is calculated by formula (7), just replace D with Di , at this time, Ck represents the data set The kth subclass in Di , |Ck | is the amount of sample data belonging to the kth subclass Ck ;
步骤5.3,依据步骤5.2计算每个小类特征的信息增益比,然后按照信息增益比从大到小将每个小类特征进行排序依次,选择信息增益比最大的特征Ag作为开始节点;Step 5.3, calculate the information gain ratio of each sub-class feature according to step 5.2, and then sort each sub-class feature according to the information gain ratio from large to small, and select the feature Ag with the largest informationgain ratio as the starting node;
步骤5.4,若特征Ag的信息增益比小于其对应小类的阈值ε,则将决策树T归为单结点树,并将训练集D中样本数据最大的小类Ck作为特征Ag的类,返回决策树T;Step 5.4, if the information gain ratio of the feature Ag is less than the threshold ε of its corresponding subclass, the decision tree T is classified as a single-node tree, and the subclass Ck with the largest sample data in the training set D is used as the feature Ag . The class of , returns the decision tree T;
若特征Ag的信息增益比大于等于其对应小类的阈值ε,则根据特征Ag的每个取值将训练集D划分为n个子集D1,D2,...,Dn,将Di中样本数据最大的小类作为开始节点的类,构建开始节点的子节点,则由开始结点及其子结点构成决策树T,并返回决策树T;If the information gain ratio of the feature Ag is greater than or equal to the threshold ε of its corresponding subclass, then according to each value of the feature Ag , the training set D is divided into n subsets D1 , D2 ,...,Dn , Taking the subclass with the largest sample data inDi as the class of the start node, and constructing the child nodes of the start node, the decision tree T is formed by the start node and its child nodes, and the decision tree T is returned;
步骤5.5,以Di为训练集,选择除了Ag以外的信息增益比最大的特征作为新的类,将其视为Ag,然后按照步骤5.4的方法执行;不断循环直到所有的特征全部分类完成,返回决策树T,预测结束,将学习成绩的合格或不合格作为最终预测结果;Step 5.5, take Di as the training set, select the feature with the largest information gain ratio except Ag as a new class, regard it as Ag , and then execute the method in step 5.4; keep looping until all the features are classified After completion, return to the decision tree T, the prediction is over, and the pass or fail of the academic performance is regarded as the final prediction result;
步骤6,按照步骤4-5的步骤分别选取经步骤2处理的静态数据和经步骤3量化的动态数据中的一部分作为测试集数据,生成最终的决策树预测模型去预测学习成绩结果,具体为:Step 6, according to the steps of steps 4-5, respectively select a part of the static data processed in step 2 and the dynamic data quantified in step 3 as the test set data, and generate the final decision tree prediction model to predict the academic performance result, specifically: :
将经步骤4.1选取之后剩余的静态数据的20%和动态数据的20%放入测试集,作为测试数据,按照步骤4.2-步骤5.6的方式对测试集数据进行操作,生成最终的决策树预测模型去预测学习成绩结果;Put 20% of the static data and 20% of the dynamic data remaining after the selection in step 4.1 into the test set, as test data, operate on the test set data in the manner of step 4.2-step 5.6 to generate the final decision tree prediction model to predict academic performance outcomes;
步骤7,通过精确率、准确率和召回率对预测结果进行判断,当精确率、准确率和召回率任意一个不小于90%时,将对应的学习行为指标可以被作为影响学习者成绩的学习行为指标;具体为:Step 7: Judging the prediction results by the precision rate, accuracy rate and recall rate. When any one of the precision rate, accuracy rate and recall rate is not less than 90%, the corresponding learning behavior index can be used as the learning that affects the learner's performance. Behavioral indicators; specifically:
步骤7.1,确定混淆矩阵;Step 7.1, determine the confusion matrix;
混淆矩阵中包括真实类别和预测类别以及真实类别和预测类别的比较结果,真实类别是学习者在自适应平台中真实的学习成绩的及格情况,预测类别是指通过测试集和决策树预测得到的学习成绩的及格情况,真实类别和预测类别的比较结果包括,如表1所示,The confusion matrix includes the comparison result between the real category and the predicted category and the real category and the predicted category. The real category is the passing situation of the learner's real academic performance in the adaptive platform, and the predicted category refers to the prediction obtained through the test set and decision tree. The passing situation of the academic performance, the comparison results of the real category and the predicted category include, as shown in Table 1,
表1Table 1
若真实类别为及格且预测类别为及格则表示被正确地划分为正例,计被正确地划分为正例的个数为TP(True Positives);If the true category is passed and the predicted category is passed, it means that it is correctly classified as positive, and the number of correctly classified as positive is TP (True Positives);
若真实类别为及格且预测类别为不及格则表示被错误地划分为负例,计被错误地划分为负例的个数为FN(False Negatives);If the true category is passing and the predicted category is failing, it means that it is wrongly classified as a negative example, and the number of wrongly classified as a negative example is FN (False Negatives);
若真实类别为不及格且预测类别为及格则表示被错误地划分为正例,计被错误地划分为正例的个数FP(False Positives);If the true category is failed and the predicted category is passed, it means that it is wrongly classified as positive, and counts the number of false positives FP (False Positives);
若真实类别为不及格且预测类别为不及格则表示被正确地划分为负例,及被正确地划分为负例的个数为TN(True Negatives);If the true category is failed and the predicted category is failed, it means that it is correctly classified as negative, and the number of correctly classified as negative is TN (True Negatives);
用P指真实类别中及格总数;N指真实类别中不及格的总数;P1指预测类别中及格总数,N1指预测类别中不及格总数;P refers to the total number of passing grades in the real category; N refers to the total number of failures in the real category; P1 refers to the total number of passing grades in the predicted category, and N1 refers to the total number of failed grades in the predicted category;
步骤7.2,根据步骤7.1得到的混淆矩阵,计算预测的精确率、准确率和召回率,具体为:Step 7.2, according to the confusion matrix obtained in step 7.1, calculate the prediction precision, precision and recall, specifically:
精确率Precision:Precision:
准确率Accuracy:Accuracy:
召回率Recall:Recall Recall:
当精确率、准确率和召回率任意一个不小于90%时,认为该学习行为指标可以被作为影响学习者成绩的指标;When any one of the precision rate, precision rate and recall rate is not less than 90%, it is considered that the learning behavior index can be used as an index that affects the learner's performance;
步骤8,根据步骤7确定的行为指标,进行K-Means聚类,确定学习群体;具体为:Step 8, according to the behavior index determined in step 7, perform K-Means clustering to determine the learning group; specifically:
步骤8.1,根据步骤7确定的影响学习者学习成绩的学习行为指标的个数k,确定得到k个聚类群体,给每个聚类群体选择一个中心点,即就是在每个群里中所选的随机数,即就是聚类中心,对每类学习行为指标对应的数据集中数据的平均值以及该学习行为指标对应的数据集中的每个学习者的数据通过欧式公式计算该数据到每一个聚类中心的距离,数据集为训练集和测试集的总和,如果距离u<0.5则将该学习者归类到本类的聚类群体中;Step 8.1, according to the number k of learning behavior indicators that affect the learner's academic performance determined in step 7, determine k clustering groups, and select a center point for each clustering group, that is, the The selected random number is the cluster center. The average value of the data in the data set corresponding to each type of learning behavior index and the data of each learner in the data set corresponding to the learning behavior index are calculated by the Euclidean formula. The distance of the cluster center, the data set is the sum of the training set and the test set, if the distance u<0.5, the learner is classified into the clustering group of this class;
欧式公式具体如下:The European formula is as follows:
其中,h表示数据集中的数据,l表示所选择的聚类中心;Among them, h represents the data in the dataset, and l represents the selected cluster center;
步骤8.2,根据步骤8.1对所有学习者进行按照学习行为指标进行归类,得到k个学习者群体;Step 8.2, according to step 8.1, classify all learners according to the learning behavior index, and obtain k learner groups;
步骤9,根据步骤8确定的具有相同学习行为的不同学习者群体,对于每一类学习者群体提供不同的学习方案,具体为:Step 9, according to the different learner groups with the same learning behavior determined in step 8, provide different learning plans for each type of learner group, specifically:
根据步骤8获得的k类学习者群体,按照步骤8的聚类方法将学习者聚类到对应的学习者群体中,自适应平台中根据聚类的学习者群体推荐平台的相关资源、合适的题目以及推荐当前学的知识点和接下来需要学习的知识点;资源包括有视频、音频、图片、文档、PPT以及教案。According to the k-type learner group obtained in step 8, the learners are clustered into corresponding learner groups according to the clustering method in step 8, and the adaptive platform recommends the relevant resources of the platform, suitable Topics and recommended knowledge points currently learned and knowledge points that need to be learned next; resources include videos, audios, pictures, documents, PPT and lesson plans.
本发明从自适应平台获得学习行为数据后,首先,需要对这些数据进行清洗、转换和集成等预处理操作,从而让数据集更有规则;其次,将从学习者的静态数据和动态数据提取可能影响学习成绩的行为指标作为预测指标;最后,将预测指标和决策树预测算法相结合去预测可能影响学习成绩的行为指标,再根据准确率、精确率和召回率评估预测算法,当拿到最终确定的行为指标后,通过K-Means聚类算法确定拥有该行为特征的学习者群体,并对不同学习群体根据不同的行为提供个性化的干预方案,提前预防学习者存在的潜在问题,从而提高学习者的效率。After the invention obtains the learning behavior data from the adaptive platform, firstly, it is necessary to perform preprocessing operations such as cleaning, conversion and integration on the data, so as to make the data set more regular; secondly, it extracts the static data and dynamic data from the learners. The behavioral indicators that may affect academic performance are used as predictive indicators; finally, the predictive indicators and decision tree prediction algorithm are combined to predict behavioral indicators that may affect academic performance, and then the prediction algorithm is evaluated according to the accuracy rate, precision rate and recall rate. After the behavior index is finally determined, the K-Means clustering algorithm is used to determine the learner group with the behavior characteristics, and provide personalized intervention plans for different learning groups according to different behaviors, so as to prevent the potential problems of the learners in advance, so that Improve learner efficiency.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111101341.4ACN113869569A (en) | 2021-09-18 | 2021-09-18 | A method of learning achievement prediction and individualized intervention based on decision tree |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111101341.4ACN113869569A (en) | 2021-09-18 | 2021-09-18 | A method of learning achievement prediction and individualized intervention based on decision tree |
| Publication Number | Publication Date |
|---|---|
| CN113869569Atrue CN113869569A (en) | 2021-12-31 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111101341.4APendingCN113869569A (en) | 2021-09-18 | 2021-09-18 | A method of learning achievement prediction and individualized intervention based on decision tree |
| Country | Link |
|---|---|
| CN (1) | CN113869569A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116523704A (en)* | 2023-04-03 | 2023-08-01 | 广州市德慷电子有限公司 | Medical practice teaching decision method based on big data |
| CN117540104A (en)* | 2023-12-20 | 2024-02-09 | 暨南大学 | Learning group difference evaluation method and system based on graph neural network |
| CN118279101A (en)* | 2024-03-20 | 2024-07-02 | 苏州国舜网络安全有限公司 | Intelligent question setting method and device based on deep learning model |
| CN118364729A (en)* | 2024-06-19 | 2024-07-19 | 江西师范大学 | A computerized adaptive diagnostic test generation method based on decision trees |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116523704A (en)* | 2023-04-03 | 2023-08-01 | 广州市德慷电子有限公司 | Medical practice teaching decision method based on big data |
| CN116523704B (en)* | 2023-04-03 | 2023-12-12 | 广州市德慷电子有限公司 | Medical practice teaching decision method based on big data |
| CN117540104A (en)* | 2023-12-20 | 2024-02-09 | 暨南大学 | Learning group difference evaluation method and system based on graph neural network |
| CN117540104B (en)* | 2023-12-20 | 2024-08-02 | 暨南大学 | A learning group difference evaluation method and system based on graph neural network |
| CN118279101A (en)* | 2024-03-20 | 2024-07-02 | 苏州国舜网络安全有限公司 | Intelligent question setting method and device based on deep learning model |
| CN118364729A (en)* | 2024-06-19 | 2024-07-19 | 江西师范大学 | A computerized adaptive diagnostic test generation method based on decision trees |
| Publication | Publication Date | Title |
|---|---|---|
| CN113869569A (en) | A method of learning achievement prediction and individualized intervention based on decision tree | |
| Wook et al. | Predicting NDUM student's academic performance using data mining techniques | |
| CN111914162B (en) | Method for guiding personalized learning scheme based on knowledge graph | |
| CN107967572A (en) | A kind of intelligent server based on education big data | |
| CN103544663A (en) | Method and system for recommending network public classes and mobile terminal | |
| CN114567815B (en) | Pre-training-based adaptive learning system construction method and device for lessons | |
| CN117726485A (en) | Intelligent adaptation education learning method and system based on big data | |
| CN113656687A (en) | Teacher portrait construction method based on teaching and research data | |
| CN110620958B (en) | A method for evaluating the quality of video courses | |
| CN111950708A (en) | A Neural Network Structure and Method for Discovering the Daily Habits of College Students | |
| KR20100000017A (en) | An intelligent customized learning service method | |
| CN119228094B (en) | A comprehensive management system for smart digital campus | |
| CN116361697A (en) | Learner learning state prediction method based on heterogeneous graph neural network model | |
| CN114330716A (en) | A CART decision tree-based method for predicting the employment of college students | |
| Li et al. | MOOC-FRS: A new fusion recommender system for MOOCs | |
| CN119537707A (en) | Personalized learning recommendation method and device based on AI workflow, and electronic device | |
| CN120596952A (en) | Education report generation method, system and medium based on learning behavior data | |
| Sghir et al. | Using learning analytics to improve students' enrollments in higher education | |
| CN115640405A (en) | A method of intelligent paper composition based on knowledge graph | |
| CN118798343A (en) | A UAV equipment training method based on large model and incremental learning | |
| CN113935869A (en) | A method and system for predicting students' grades based on the combination of subjective and objective factors | |
| CN112507792A (en) | Online video key frame positioning method, positioning system, equipment and storage medium | |
| CN111401525A (en) | Adaptive learning system and method based on deep learning | |
| CN117934227A (en) | Student multi-dimensional capability assessment method based on campus fusion data | |
| Wu | [Retracted] Higher Education Environment Monitoring and Quality Assessment Model Using Big Data Analysis and Deep Learning |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date:20211231 |