FairMedia - Research

Fair and Trusted Datasets for Media Computing

Background

The media sector is facing unprecedented challenges. Tech companies are entering the media market, social media increasingly satisfy information needs, speed and volume of information has increased while audience attention span has decreased, and the rise of misinformation on many channels are just a few examples of the changes that have recently taken place. Artificial intelligence (AI)-powered media production technologies can support the media industry in responding to these challenges and preserving its pivotal role in democracy. The (large-scale) training datasets used by AI-powered technologies, however, are primarily not European in origin. Thus, the underlying datasets are not under control by European organisations, nor are they protected under European data privacy regulations. What is even more, media organizations often need individual AI solutions. This, however, calls for a great deal of programming work or using cloud-hosted services that are mostly general purpose and not customized for journalism tasks.

Project Content

FAIRmedia addresses the challenges of the media sector. We intend to develop a toolchain and guidelines (including legal issues) for creating datasets and utilizing them in model training. More precisely, users are guided through straightforward processes of selecting, preparing and annotating data, determining possible biases in the datasets and feeding it into machine learning models. We also plan to provide Explainable AI tools that enable users to assess model quality and whether the model works as intended. The tools cover two relevant use cases of AI in the media domain: classification and question answering. They are specifically designed for these use cases, setting them apart from general-purpose tools.

Goal

FAIRmedia's seeks to investigate and build AI-based instruments for use in media production, allowing domain specialists such as journalists, archivists, and researchers to create their own datasets and tools and evaluate them in content classification and question-answering scenarios. In particular, we pursue the following project aims:

Develop a toolchain to enable domain experts to curate and annotate textual and audiovisual datasets.
Set up guidelines for dealing with intellectual property and privacy issues.
Develop research methods that make it possible to adapt classification and question answering models to different tasks.
Develop research methods for bias analysis.
Apply research methods that help to understand how the models arrived at their decisions (post-hoc explainability).
Build up a network that connects to the growing EU Media Data Space.

Procedure & Use Cases

The first use case revolves around the categorization of content on the basis of user needs. Categorisation tasks are essential as they not only enable audience-centred content production and scheduling of content, but also aid in monitoring the output of media organisations. Adjusting classification processes to different media types and organisations, and dealing with imbalances in categories (making assignment to a distinct category challenging) requires the domain knowledge of a journalist and is the core issue of this use case.

A second use case of high relevance is question answering (i.e., the posing of factual questions to a corpus of media assets). There has been significant progress in this area, yet some problems are still unresolved. For instance, the content used for large pretrained models is not fully known, hence it is difficult to understand on what basis the models make their decisions. Also, models are usually general purpose and have limited utility answering questions pertaining to the regional or local context. In this project, we rely on content stored in the archives of media organizations and that has been prepared in accordance with the highest journalistic standards. In summary, the second use case involves the preparation of a dataset, training a model using the dataset, and validating the model's performance.

Result

The FAIRmedia project supports media organizations in generating datasets for AI training and provides them with guidelines on best practices. It expands competences in learning from limited data, on explainable AI and no-code tools for machine learning (ML). FAIRmedia also strengthens self-creation capabilities by developing tools that enable journalists and researchers to produce custom datasets and employ AI technologies.