Masakhane: Use of the JW300 Dataset for Natural Language Processing

The Masakhane Project showcases the transformative power of open, collaborative efforts in advancing natural language processing (NLP) for African languages. However, its reliance on the JW300 dataset—a vast multilingual corpus primarily comprising copyrighted biblical translations—uncovered significant legal and ethical challenges. These challenges focused on copyright restrictions, contract overrides, and the complexities of cross-border data use. This led to the discontinuation of JW300’s use within Masakhane, prompting a shift toward community-generated data. The experience illustrates the urgent need for robust copyright exceptions, clear legal frameworks, and ethical data sourcing to foster innovation and inclusivity in global NLP research.

A video version of the case study is available below.

1. What Is Natural Language Processing?

Natural Language Processing (NLP) is a branch of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language, both written and spoken. NLP integrates computational linguistics with machine learning, deep learning, and statistical modeling, allowing machines to recognize patterns, extract meaning, and respond to natural language inputs in ways that approximate human comprehension.

NLP underpins many everyday technologies, including search engines, digital assistants, chatbots, voice-operated GPS systems, and automated translation services. NLP is crucial for breaking down language barriers and has become integral to the digital transformation of societies worldwide1.

2. Masakhane Project Overview

The Masakhane Project is an open-source initiative dedicated to advancing NLP for African languages. Its mission is to democratize access to NLP tools by building a continent-wide research community and developing datasets and benchmarks tailored to Africa’s linguistic diversity. By engaging researchers, linguists, and technologists across the continent, Masakhane ensures that African languages are not marginalized in the digital age.

The project employs advanced sequence-to-sequence models, training them on parallel corpora to enable machine translation and other NLP tasks between African languages. The distributed network of contributors allows Masakhane to address the unique challenges of Africa’s linguistic landscape, where many languages lack sufficient digital resources.

A notable achievement is the “Decolonise Science” project, which creates multilingual parallel corpora of African research by translating scientific papers from platforms like AfricArxiv into various African languages. This initiative enhances access to academic knowledge and promotes the use of African languages in scientific discourse, exemplifying Masakhane’s commitment to African-centric knowledge production and community benefit.

3. JW300 Dataset and Its Role

The JW300 dataset was pivotal to Masakhane’s early work. It offers around 100,000 parallel sentences for each of over 300 African languages, mostly sourced from Jehovah’s Witnesses’ biblical translations. For many languages, JW300 is one of the only large-scale, aligned text sources available, making it invaluable for training baseline translation models such as English-to-Zulu or English-to-Yoruba.

Masakhane utilized automated scripts for downloading and preprocessing JW300, including byte-pair encoding (BPE) to optimize model performance. Community contributions further expanded the dataset’s coverage, filling language gaps and improving resource quality. JW300’s widespread use enabled rapid progress in building machine translation models for underrepresented African languages.

4. Copyright Infringement Discovery

Despite JW300’s open availability on platforms like OPUS, its use was legally problematic. In 2023, a legal audit by the Centre for Intellectual Property and Information Technology (CIPIT) in Nairobi revealed that the Jehovah’s Witnesses’ website explicitly prohibited text and data mining in its copyright notice. This meant Masakhane’s use of JW300 was unauthorized.

When Masakhane’s organizers formally requested permission to use the data, their request was denied. This highlighted a fundamental tension between Masakhane’s open research ethos and the proprietary restrictions imposed by the dataset’s owners, forcing the project to reconsider its data strategy.

5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use

Many jurisdictions provide copyright exceptions and limitations to balance creators’ rights with the needs of researchers and innovators. The European Union’s text and data mining (TDM) exceptions and the United States’ Fair Use doctrine are prominent examples.

The EU’s Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790) introduced two mandatory TDM exceptions. The first allows research organizations and cultural heritage institutions to conduct TDM for scientific research, regardless of contractual provisions. The second permits anyone to perform TDM for any purpose, provided the rights holder has not expressly opted out. Recent German case law clarified that an explicit reservation in a website’s terms is sufficient to exclude commercial TDM, but the exception remains robust for research contexts.

In the U.S., the Fair Use doctrine allows limited use of copyrighted material without permission for purposes like criticism, comment, teaching, scholarship, or research. Courts increasingly recognize that using copyrighted works to train AI models can qualify as Fair Use, especially when the use is transformative and does not harm the original work’s market.

Had Masakhane operated in the EU or U.S., these exceptions might have provided a legal basis for using JW300 for non-commercial research. However, most African countries lack clear TDM provisions or Fair Use recognition, exposing researchers to greater legal uncertainty and risk. The Masakhane experience underscores the need for African nations to adopt or clarify copyright exceptions that support research and digital innovation.

6. Contract Overrides

Contract overrides occur when contractual terms—such as website terms of service—impose restrictions beyond those set by statutory copyright law. In JW300’s case, Jehovah’s Witnesses’ website terms explicitly prohibit text and data mining, overriding any potential exceptions or fair use provisions.

For Masakhane, this meant that even if their use could be justified under fair use or research exceptions in some jurisdictions, the contractual terms imposed stricter limitations. Only in jurisdictions where statutes prevent contracts from overriding copyright exceptions (such as the EU’s TDM provision for research institutions) could these terms be challenged. This highlights the importance of reviewing all terms of service and data use agreements before using third-party datasets, especially in open, cross-border research projects.

7. Cross-Border Use

The cross-border nature of datasets like JW300 adds further legal complexity, especially for open research projects with contributors across multiple countries. Masakhane operates in a pan-African context, with team members and users in different nations.

Copyright and data use laws vary widely. What is permissible under fair use or research exceptions in one country may be prohibited in another. Contractual restrictions may also be enforceable in some countries but not others, depending on local contract law and international agreements.

For Masakhane, this meant legal risks varied among participants. Contributors in countries with strong contract protections faced greater liability, while those in jurisdictions with robust fair use or research exceptions had more flexibility. This variability complicates project management and underscores the need for harmonized legal frameworks to support cross-border collaboration in NLP and related fields.

8. Impact of Copyright Enforcement

Enforcement of copyright restrictions had immediate, significant consequences for Masakhane. The project was forced to abandon JW300, halting ongoing translation projects and preventing the deployment or sharing of models trained on this data. These setbacks led Masakhane to pivot toward community-driven data collection initiatives, such as Kencorpus.

By focusing on creating original, ethically sourced datasets, Masakhane aims to avoid legal risks and ensure the sustainability of its work. This transition underscores the importance of data provenance and permissive licensing in developing NLP resources for low-resource languages1.

9. Broader Implications for NLP Research

The challenges Masakhane faced reflect broader systemic issues in low-resource language initiatives. Copyright barriers disproportionately affect projects that rely on religious or cultural texts, which are often the only available resources for underrepresented languages. Legal uncertainty around reusing such texts can discourage researchers and stifle innovation.

There is growing advocacy for adopting TDM exceptions in African copyright law and developing ethical frameworks for using culturally significant texts in research. These measures would help balance rights holders’ interests with the public good of advancing linguistic diversity and inclusion.

10. Conclusion

The Masakhane Project’s experience with JW300 highlights both the promise and pitfalls of leveraging existing resources for African NLP. While JW300 enabled rapid progress in machine translation for many African languages, its proprietary nature and associated legal complexities ultimately forced Masakhane to change course. This case underscores the urgent need for balanced intellectual property frameworks that support open research while respecting legal and cultural boundaries. Masakhane’s commitment to ethically sourced, community-generated data sets a valuable precedent for the sustainable development of inclusive language technologies.

Video Version

Hear from the researchers themselves. Watch the video of this case study below.

Masakhane: Use of the JW300 Dataset for Natural Language Processing

1. What Is Natural Language Processing?

2. Masakhane Project Overview

3. JW300 Dataset and Its Role

4. Copyright Infringement Discovery

5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use

6. Contract Overrides

7. Cross-Border Use

8. Impact of Copyright Enforcement

9. Broader Implications for NLP Research

10. Conclusion

Video Version

Search

Follow

PIJIP

Infojustice Roundup

Free to Share

Blog Categories

Comments on:

Civil Society Documents

Comments on:

industry | infojustice

Masakhane: Use of the JW300 Dataset for Natural Language Processing

1. What Is Natural Language Processing?

2. Masakhane Project Overview

3. JW300 Dataset and Its Role

4. Copyright Infringement Discovery

5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use

6. Contract Overrides

7. Cross-Border Use

8. Impact of Copyright Enforcement

9. Broader Implications for NLP Research

10. Conclusion

Video Version

Related Posts

Search

Follow

PIJIP

Infojustice Roundup

Free to Share

Blog Categories