July 24, 2024

Academic Baby Steps

Approaching research as an undergraduate student

There's a common misconception about studying computer science at university: that it's supposed to make you a (good) programmer. A sarcastic remark that uni is useless usually comes with it. After all, how much of what you learn will be useful for the software developer job you're most likely going to do later on? Clearly, uni wastes time on things that don't matter to most people, so computer science degrees are essentially a failure.

The problem with this sentiment is, of course, that the purpose of uni – especially in a degree as broad as B.Sc. comp sci – is not to prepare you for a programming job, or any concrete job at all. Its real purpose is to give you space to figure out what parts of a field interest you the most, and prepare you for continuing on an academic path towards doing scientific research, by pursuing a Master's degree later on.

This is not the path that everybody wants to take, and that's fair. Like many people, I dislike this weird societal limbo where we pretend that, on the one hand, you need a university degree for many industry jobs, but on the other hand, that degree isn't designed to be actually useful for that job. It's not as bad as people sometimes make it out to be, but it's certainly a bit strange.

That said, I do like the way university currently works, for the most part. I like that I can study things that interest me, even if they don't provide "economic value" for some software company. I'm not exactly sure where life will take me career-wise, but as long as the world doesn't end until then, I can certainly see myself working in academia too.

Technically, I am a published author

This is true: there is a public, peer-reviewed paper out there with my name on it, which is a somewhat decent thing to brag about as a Bachelor student. Now, it wasn't published published – it's not in any journal. But the story behind it still taught me a lot and gave me a perspective on academic computer science very early on.

At KIT, all computer science students have to do a practical project at some point, it's a mandatory part of the curriculum. The point of this project is to emulate a "real-world" software project: you work as a team, have a "customer" (the advisor(s)), and you develop the product in stages. Where the "real-world" aspect somewhat falls apart is the organisational model. Everything must be meticulously and painstakingly specified up front, from the initial requirements to the actual architecture and design of the software, without writing any code. This is "waterfall"-style design, and while I'm sure it taught us how to draw UML diagrams, I'm not so certain it was the best way to write the software.

We did this work for a specific research group at KIT, to help the researchers improve a certain aspect of their own work. And we put a lot of effort into it. I was lucky to have a great team, and against all odds, we managed to produce something halfway decent. Since it was relevant to our advisors' research, they decided to write a paper about it and submit that to a small workshop at a pretty big academic conference. The paper got accepted, and one person from our team was invited to go to the conference and present our work. Obviously nobody wanted to go alone, so we managed to negotiate a deal where two of us could go and we'd get 2 flights and 3 nights of accommodation paid for. If you ask me, KIT was still being kind of cheap here (surely they'd have had enough money to spare for the other person's flights?), but whatever. I was one of the people to go.

This was my first (and, as of writing this, only) conference ever. We were a clueless pair of Bachelor students attending an event¹ that was way out of our league. But it was incredibly exciting to present something we made to leading researchers in the field, and I'm super grateful for having had the opportunity to dip my toes into research so soon.

My first seminar

In a seminar at university, you get introduced to a field of research and then select one of several topics in that field. Your task is to write a paper on this topic, introducing it and the current research around it. You don't do research yourself – you read other people's works.

Participating in a seminar is another mandatory part of the curriculum for us. The mandatory seminar is meant to introduce you to the scientific working method by teaching you how to find and read literature and how to write and format scientific papers yourself. This knowledge is crucial for writing a thesis, and seminars are typically the first time students learn about any of this.

This semester, I participated in a seminar titled "Software Sustainability" and the title of my assigned topic was "Ethics in the Digital Age: Privacy, Data Protection, and other Big Data Challenges". A mouthful, I know. On its own, this is pretty vague, so a significant amount of effort was needed to narrow down what I actually wanted to write about. What I ended up with is a paper on ethical issues in connection with "Big Data", and the proposed ways to approach them. I read a lot of literature – more than is expected of students for this kind of seminar – but the topic coincided with my personal interests, and especially my critical view of AI, so it came naturally.

I will never escape LLM slop

The seminar was special in a way: its procedure was somewhat similar to an actual conference proceedings: we had to submit a draft and review other submissions. I really liked this, because peer-review is one of those things you should probably already be familiar with before you submit research to conferences or workshops for the first time.

Unfortunately, the submission I had to review was... bad. In short, the paper was mostly LLM-generated, and this fact was pretty obvious. It was written in this sort of distinctively non-distinctive ChatGPT style of mediocrity (and lots of bullet points) while not really saying anything of substance.

I ended up producing quite an extensive review, mainly because it was also fun for me to chase down the different artifacts that were probably a result of "hallucination" and similar. Most blatantly, the author didn't even bother to double-check the citations that ChatGPT (or whatever they used) spat out, so most of it was just complete nonsense, including one reference that actually did not exist. To get an impression, here's an excerpt of my review:

The references are highly problematic. Citations are lacking severely, and those that exist are questionable, with some sources being cited partially or entirely inaccurately, and another even being seemingly inexistent. The paper is therefore far away from being a "literature study" as it claims in the abstract.

Despite the clear indications that this was low-effort AI slop, I kept the review neutral and objective and without snark, simply because the point of peer-review in science isn't to attack people (even when they might deserve it). Needless to say, I nonetheless think that submitting an LLM-generated paper and hoping that nobody will find out is embarrassingly stupid at best and a case of worrying levels of hubris at worst.

I got an award!

But back to the paper I wrote. The seminar finished yesterday and I received the "best paper" award for my submission! In the context of a small university course, this isn't all that significant of course, but I was still proud that my work was recognised. If you're interested in the topic of viewing Big Data from a critical lens of ethics, definitely check out some of the sources for my paper, in particular:

"Weapons of math destruction" by Cathy O'Neil
"Resisting AI" by Dan McQuillan
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? by Emily Bender et al.
Data Colonialism: Rethinking Big Data's Relation to the Contemporary Subject by Nick Couldry and Ulises Mejias
- This article unfortunately is not open access.
- I found the ideas of it interesting enough that I got my uni's library to buy the book "Data Grab" by the same authors. I haven't read it yet, but I'm going to some time this year.
The work of the Distributed AI Research Institute (DAIR)

Here's a section from my paper on algorithmic bias that I think turned out particularly good:

2.2 Algorithmic Bias

Big Data works on the basis of two presuppositions: knowledge of correlations allows us to build effective models of reality, and these correlations can be found by analysing large amounts of existing data [43]. Potentials for bias, i.e. the production of unfair disadvantages, can be identified in both of these components.

First, there is a fundamental criticism of using correlations as the basis for predictive models. Just because Big Data finds a correlation between two properties, this does not allow for conclusions regarding causation. Moreover, finding causes for relations in data is not even a goal of Big Data: its goal is “to predict and target, not to provide any sociological accounting for the reasons why people might seem to occupy particular patterns of life” [39, p. 40]. Applying Big Data’s correlation-centered view to science can be particularly problematic. Fields like sociogenomics attempt to use data analysis to correlate genetic traits with social behaviours in humans. A disregard for causality in such contexts can be a catalyst for pseudoscientific tendencies [39, pp. 68f.].

A more concrete issue with basing models on mere correlations can be found in the concept of proxies, i.e. the substitution of a complex, qualitative property for a simpler numerical value that is assumed to be correlated to the real observation. For example, a property that is difficult to quantify is human health. So, in order to build predictive models for healthcare systems, data analysts have traditionally substituted “health” for the proxy of “health care cost”. These models are thus designed to predict health care cost, but treated as if they actually predicted health, under the assumption that cost is an effective stand-in for level of illness [44]. Such proxies lead to weaker, more distant and discriminatory correlations [43, pp. 17f.] [39, p. 68]. The proxy of health care cost, in particular, has led to a significant racial bias: due to unequal access to health care in general, black patients in the US were assigned the same risk score (and thus, level of covered treatment) as white patients despite being significantly more ill [44].

Data in itself can already include biases. This begins with the selection of data, or what is “configured for ‘capture’” [17]. For example, the data used to train OpenAI’s GPT-3 language model included parts of Common Crawl, a dataset containing many pages of the world wide web [11]. While one might assume that the world wide web is a good approximation of the diversity of thought, people, and culture found in the real world, this is not the case. First, Internet access is not available everywhere and to everyone in the world, leading to an overrepresentation of younger people from developed countries [5]. Furthermore, datasets like these additionally get curated and filtered before being used as training data, which can accentuate disparities between apparent representation and reality even more [5]. This can happen when, for example, an LGBTQ forum is removed from the dataset, but a social network predominantly used by white men is kept, which would lead to the language model learning a bias towards white men. Many such biases are known to exist in large language models [5], even in subtle ways like prejudice based on dialect [32]. As another example, a data selection bias can also be found in commercial face recognition software, which performs significantly worse on dark-skinned women compared to light-skinned men [13].

Eliminating statistical representation bias from datasets is not sufficient for eliminating bias from data altogether, however. The context of dataset creation, i.e. the goals of creating a dataset, the values associated with it, and the conditions of data work involved (see subsection 2.5), are another vector for introducing bias [18]. For example, the popular computer vision dataset ImageNet was created with the epistemological assumption that there is “an underlying and universal organization of the visual world in to clearly demarcated concepts” [18]. This is a problem, because the meaning of images cannot be determined from an absolute or objective viewpoint: lived experiences, situation, and context have an impact on how we understand images. This nuance is completely absent from ImageNet [18]. A similar problem comes to light with cultural differences between those that dictate the parameters of Big Data systems and those that are asked to do data work. For example, a US company may request from workers classification tasks such as deciding whether social media posts are hateful or not, but the US-centric worldview of what counts as hateful that is presumed self-evident may not be well understood by a Latin American worker [40].

Finally, there is a potential for a kind of historical bias in Big Data systems. Since classification and prediction rely on data collected and analysed in advance, Big Data’s “modelling is stuck in abstractions drawn from the past, and so becomes a rearrangement of the way things have been rather than a reimagining of the way things could be” [39, p. 43]. This is especially a problem for machine learning models that are computationally expensive to train and hard to change after the fact. The notion of static data conflicts with changing societal views, because it can create “value-lock” [5]. Going even further, when Big Data’s predictions are applied to decisions that impact the future, it can create a kind of feedback loop, where its models “help to create the environment that justifies their assumptions” [43, p. 29]. In other words, if biased, data-driven predictive technologies are used as a basis for decision making, they validate themselves by causing the manifestation of their own predictions.

If you're interested in reading the rest of it, the pdf is available on this website: Ethics and Big Data

More science

Apart from the workshop project and the software sustainability seminar, there are three more things in my studies that are kind of science-y:

Seminar on "EU Digital Regulatory Framework" (I wrote a report on the Digital Services Act)
A paper I'm writing for another course that is about applying email phishing techniques to social media
Bachelor thesis (TBD)

I've already finished the EU law seminar, so if you're interested in my report (i.e., summary) of the Digital Services Act or have any questions about this piece of EU regulation, hit me up!

Actually, we didn't have tickets for the actual conference (which would have cost us like $700 each), just for two days of workshops before that. We still snuck into the real conference to see what it's like and to steal from the buffet, though.
↩

Tags:

Academic Baby Steps

Technically, I am a published author

My first seminar

I will never escape LLM slop

I got an award!

More science

Comments

Posts from my blogroll

Geo Racers: the most fun web travel game since GeoGuessr

Mais c’est plus joli !

Friday Squid Blogging: Vampire Squid Genome