This site requires Javascript to be turned on. Please enable Javascript and reload the page.

AI for Teachers, An Open Textbook: Edition 1

Similarities in life and distances in data

This page is still being processed. Please come back later!

Ethical guidelines on the use of artificial intelligence and data in teaching and learning for educators, European Commission, October 2022 AI-supported
collaborative learning :
Data on each learner’s work style and past performance is used to divide them into groups
with the same ability levels or suitable mix of abilities and talents. AI systems provide
inputs/suggestions on how a group is working together by monitoring the level of interaction
between group members.

Similarities in life, distances in data

MIT : "The most common
type of unsupervised learning is cluster analysis, where the
algorithm looks for clusters of instances that are more similar
to each other than they are to other instances in the
data. These clustering algorithms often begin by guessing
a set of clusters and then iteratively updating the clusters
(dropping instances from one cluster and adding them to
another) so as to increase both the within-cluster similarity
and the diversity across clusters.
A challenge for clustering is figuring out how to measure
similarity. If all the attributes in a data set are numeric
and have similar ranges, then it probably makes sense just
to calculate the Euclidean distance (better known as the
straight-line distance) between the instances (or rows).
Rows that are close together in the Euclidean space are
then treated as similar. A number of factors, however, can
make the calculation of similarity between rows complex.
In some data sets, different numeric attributes have different
ranges, with the result that a variation in row values
in one attribute may not be as significant as a variation of
a similar magnitude in another attribute. In these cases,
the attributes should be normalized so that they all have
the same range. Another complicating factor in calculating
similarity is that things can be deemed similar in many
different ways. Some attributes are sometimes more important
than other attributes, so it might make sense to weight some attributes in the distance calculations, or it
may be that the data set includes nonnumeric data. These
more complex scenarios may require the design of bespoke
similarity metrics for the clustering algorithm to use."...... "is.
An unsupervised clustering algorithm will look for
groups of rows that are more similar to each other than
they are to the other rows in the data. Each of these groups
of similar rows defines a cluster of similar instances. For
instance, an algorithm can identify causes of a disease or
disease comorbidities (diseases that occur together) by
looking for attribute values that are relatively frequent
within a cluster. The simple idea of looking for clusters of
similar rows is very powerful and has applications across
many areas of life. Another application of clustering rows
is making product recommendations to customers. If a
customer liked a book, song, or movie, then he may enjoy
another book, song, or movie from the same cluster."
video

MIT: "The standard data science approach to this type of
analysis is to frame the problem as a clustering task. Clustering
involves sorting the instances in a data set into subgroups containing similar instances. Usually clustering
requires an analyst to first decide on the number of subgroups
she would like identified in the data. This decision
may be based on domain knowledge or informed by project
goals. A clustering algorithm is then run on the data
with the desired number of subgroups input as one of the
algorithms parameters. The algorithm then creates that
number of subgroups by grouping instances based on the
similarity of their attribute values. Once the algorithm has
created the clusters, a human domain expert reviews the
clusters to interpret whether they are meaningful. In the
context of designing a marketing campaign, this review
involves checking whether the groups reflect sensible customer
personas or identifies new personas not previously
considered.....As is true of all data science projects, one of the biggest
challenges with clustering is to decide which attributes to
include and which to exclude so as to get the best results.............One of the advantages of clustering as an analytics
approach is that it can be applied to most types of data.
Because of its versatility, clustering is often used as a dataexploration
tool during the data-understanding stage of
many data science projects. Also, clustering is also useful
across a wide range of domains. For example, it has been
used to analyze students in a given course in order to identify
groups of students who need extra support or prefer
160 Chapter 5
different learning approaches. It has also been used to
identify groups of similar documents in a corpus, and in
science it has been used in bio-informatics to analyze gene
sequences in microarray analysis."

"How do we operationalize diversity in a selection task? If we had a distance
function between pairs of candidates, we could measure the average distance
between selected candidates. As a strawman, let’s say we use the Euclidean
distance based on the GPA and interview score. If we incorporated such a diversity
criterion into the objective function, it would result in a model where the GPA is
weighted less. This technique has the advantage of being blind: we didn’t explicitly
consider the group membership, but as a side-effect of insisting on diversity of
the other observable attributes, we have also improved demographic diversity.
However, a careless application of such an intervention can easily go wrong: for
example, the model might give weight to attributes that are completely irrelevant
to the task.
More generally, there are many possible algorithmic interventions beyond
picking different thresholds for different groups. In particular, the idea of a
similarity function between pairs of individuals is a powerful one, and we’ll see
other interventions that make use of it. But coming up with a suitable similarity
function in practice isn’t easy: it may not be clear which attributes are relevant,
how to weight them, and how to deal with correlations between attributes" FML

This page is referenced by:

AI Speak : Data based systems Part 1 1 plain 2023-01-04T08:02:47+00:00
Decisions in the classroom
As a teacher, you have access to many kinds of data. Either tangible data like attendance and performance records or intangible ones like student body language. Consider some of the decisions you take in your professional life: What are the data that help you make these decisions?

There are technological applications that can help you visualize or process data. Artificial intelligence systems use data to personalise learning, make predictions and decisions that might help you teach and manage the classroom : Do you have needs that technology can answer? If yes, what will be the data such a system might require to carry out the task?
Educational systems have always generated data - student personal data, academic records, attendance data and more. With digitalisation and AIED applications, more data is recorded and stored : mouse clicks, opened pages, timestamps and keyboard strokes.¹With data-centric thinking becoming the norm in the society, it is natural to ask how to crunch all this data to do something pertinent : Could we give more personalised feedback for the learner? Could we design better visualisation and notification tools for the teacher?²

Whatever be the technology used, it has to meet a real requirement in the classroom. After the need is identified, we can look at the data available and ask what is relevant to a desired outcome. This involves uncovering factors that let educators make nuanced decisions. Can these factors be captured using available data? Is data and data-based systems the best way of addressing the need? What could be the unintended consequences of using data this way? ³

Machine learning lets us defer many of these questions to the data itself.⁴ ML applications are trained on data. They work by operating on data. They find patterns and make generalisations and store these as models - data that can be used to answer future questions.⁴ Their decisions and predictions, and how these affect student learning, are all data too. Thus, knowing how programmers, the machine and the user handle data is an important part of understanding how artificial intelligence works.
About Data
Data is generally about a real world entity - a person,an object, or an event. Each entity can be described by a number of attributes (features or variables).⁵For example, name, age and class are some attributes of a student. The set of these attributes is the data we have on the student, which, while not in any way close to the real entity, does tell us something about them. Data collected, used and processed in the educational system is called educational data.¹

A dataset is the data on a collection of entities arranged in rows and columns . The attendance record of a class is a dataset. Each row is the record of one student. The columns could be their presence or absence during a particular day or session. Thus each column is an attribute.

Data is created by choosing attributes and measuring them : every piece of data is the result of human decisions and choices. Thus, data creation is a subjective, partial and messy process prone to technical difficulties.^4,5. Further, what we choose to measure, and what we don't can have a big influence on expected outcomes.
Data traces are records of student activity such as mouse clicks, data on opened pages, the timing of interactions or key presses in a digital system.¹ Metadata—that is, data that describe other data.⁵ Derived data is data calculated or inferred from other data : Individual scores of each student is data. The class average is derived data. Often, derived data is more useful in getting useful insights, finding patterns and making predictions. Machine Learning applications can create derived data and link it with metadata data traces to create detailed learner models, which help in personalising learning.¹

For any data based application to be successful, attributes should be carefully chosen and correctly measured. The patterns discovered in them should be checked to see if they make sense in the educational context. When designed and maintained correctly, data driven systems can be very valuable.

This chapter aims to introduce a few basics of data and data based technology but data literacy is a very important skill to possess and merits dedicated training and continuing support and update.¹

Legislation you should know about
Because of the drastic drop in costs of data storage, more data and metadata are saved and retained for a longer time.⁶This can lead to privacy breaches and rights violations. Laws like the General Data Protection Regulation (GDPR) discourages such practices and gives EU citizens more control over their personal data. They give legally enforceable data protection regulations across all EU member states.

According to GDPR, personal data is any information relating to an identified or identifiable person (data subject). Schools, in addition to engaging with companies that handle their data, store huge amounts of personal information about students, parents, staff, management, and suppliers. As data controllers, they are required to store data which they process confidentially and securely and have procedures in place for the protection and proper use of all personal data.¹

Rights established by the GDPR include :
- The Right to Access that makes it mandatory for them to know(easily) what data is being collected about them
- The citizen’s Right to Be Informed of the usage made of their data
- The Right to Erasure that allows a citizen whose data has been collected by a platform to ask for that data to be removed from the dataset built by the platform (and which may be sold to others)
- The Right to explanation where explanation should be provided whenever they need clarification on automated decision processes that affect them
Although, GDPR does allow for collection of some data under “legitimate interest”⁷and the use of derived, aggregated, or anonymized data indefinitely and without consent.⁵ The new Digital Services Act restricts the use of personal data for targeted advertising purposes."⁷ In addition to these, the EU-US Privacy Shield strengthens the data-protection rights for EU citizens in the context where their data have been moved outside of the EU.⁵

Please refer to GDPR for dummies for the analysis done by independent experts from the Civil Liberties Union for Europe (Liberties), which is a watchdog that safeguards the human rights of everyone in the European Union.

------------------------------------------------------------------------------------------------------
¹Ethical guidelines on the use of artificial intelligence and data in teaching and learning for educators, European Commission, October 2022
² du Boulay, B., Poulovasillis, A., Holmes, W., Mavrikis, M., Artificial Intelligence And Big Data Technologies To Close The Achievement Gap,in Luckin, R., ed. Enhancing Learning and Teaching with Technology, London: UCL Institute of Education Press, pp. 256–285, 2018
³Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., Mitchell, M., Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery, New York, 2021
⁴Barocas, S., Hardt, M., Narayanan, A., Fairness and machine learning Limitations and Opportunities, yet to be published
⁵Kelleher, J.D, Tierney, B, Data Science, London, 2018
⁶ Schneier, B., Data and Goliath: The Hidden Battles to Capture Your Data and Control Your World, W. W. Norton & Company, 2015
⁷ Kant, T., Identity, Advertising, and Algorithmic Targeting: Or How (Not) to Target Your “Ideal User.”, MIT Case Studies in Social and Ethical Responsibilities of Computing, 2021

This page references:

Workinprogress