AI for Teachers, An Open Textbook: Edition 1

Similarities in life and distances in data

 

This page is still being processed. Please come back later!

Ethical guidelines on the use of artificial intelligence and data in teaching and learning for educators, European Commission, October 2022 AI-supported
collaborative learning :
Data on each learner’s work style and past performance is used to divide them into groups
with the same ability levels or suitable mix of abilities and talents. AI systems provide
inputs/suggestions on how a group is working together by monitoring the level of interaction
between group members.

Similarities in life, distances in data

MIT : "The most common
type of unsupervised learning is cluster analysis, where the
algorithm looks for clusters of instances that are more similar
to each other than they are to other instances in the
data. These clustering algorithms often begin by guessing
a set of clusters and then iteratively updating the clusters
(dropping instances from one cluster and adding them to
another) so as to increase both the within-cluster similarity
and the diversity across clusters.
A challenge for clustering is figuring out how to measure
similarity. If all the attributes in a data set are numeric
and have similar ranges, then it probably makes sense just
to calculate the Euclidean distance (better known as the
straight-line distance) between the instances (or rows).
Rows that are close together in the Euclidean space are
then treated as similar. A number of factors, however, can
make the calculation of similarity between rows complex.
In some data sets, different numeric attributes have different
ranges, with the result that a variation in row values
in one attribute may not be as significant as a variation of
a similar magnitude in another attribute. In these cases,
the attributes should be normalized so that they all have
the same range. Another complicating factor in calculating
similarity is that things can be deemed similar in many
different ways. Some attributes are sometimes more important
than other attributes, so it might make sense to weight some attributes in the distance calculations, or it
may be that the data set includes nonnumeric data. These
more complex scenarios may require the design of bespoke
similarity metrics for the clustering algorithm to use."...... "is.
An unsupervised clustering algorithm will look for
groups of rows that are more similar to each other than
they are to the other rows in the data. Each of these groups
of similar rows defines a cluster of similar instances. For
instance, an algorithm can identify causes of a disease or
disease comorbidities (diseases that occur together) by
looking for attribute values that are relatively frequent
within a cluster. The simple idea of looking for clusters of
similar rows is very powerful and has applications across
many areas of life. Another application of clustering rows
is making product recommendations to customers. If a
customer liked a book, song, or movie, then he may enjoy
another book, song, or movie from the same cluster."
video

MIT: "The standard data science approach to this type of
analysis is to frame the problem as a clustering task. Clustering
involves sorting the instances in a data set into subgroups containing similar instances. Usually clustering
requires an analyst to first decide on the number of subgroups
she would like identified in the data. This decision
may be based on domain knowledge or informed by project
goals. A clustering algorithm is then run on the data
with the desired number of subgroups input as one of the
algorithms parameters. The algorithm then creates that
number of subgroups by grouping instances based on the
similarity of their attribute values. Once the algorithm has
created the clusters, a human domain expert reviews the
clusters to interpret whether they are meaningful. In the
context of designing a marketing campaign, this review
involves checking whether the groups reflect sensible customer
personas or identifies new personas not previously
considered.....As is true of all data science projects, one of the biggest
challenges with clustering is to decide which attributes to
include and which to exclude so as to get the best results.............One of the advantages of clustering as an analytics
approach is that it can be applied to most types of data.
Because of its versatility, clustering is often used as a dataexploration
tool during the data-understanding stage of
many data science projects. Also, clustering is also useful
across a wide range of domains. For example, it has been
used to analyze students in a given course in order to identify
groups of students who need extra support or prefer
160 Chapter 5
different learning approaches. It has also been used to
identify groups of similar documents in a corpus, and in
science it has been used in bio-informatics to analyze gene
sequences in microarray analysis."

"How do we operationalize diversity in a selection task? If we had a distance
function between pairs of candidates, we could measure the average distance
between selected candidates. As a strawman, let’s say we use the Euclidean
distance based on the GPA and interview score. If we incorporated such a diversity
criterion into the objective function, it would result in a model where the GPA is
weighted less. This technique has the advantage of being blind: we didn’t explicitly
consider the group membership, but as a side-effect of insisting on diversity of
the other observable attributes, we have also improved demographic diversity.
However, a careless application of such an intervention can easily go wrong: for
example, the model might give weight to attributes that are completely irrelevant
to the task.
More generally, there are many possible algorithmic interventions beyond
picking different thresholds for different groups. In particular, the idea of a
similarity function between pairs of individuals is a powerful one, and we’ll see
other interventions that make use of it. But coming up with a suitable similarity
function in practice isn’t easy: it may not be clear which attributes are relevant,
how to weight them, and how to deal with correlations between attributes" FML

 

This page is referenced by:

This page references: