What do we talk about when we say Text and Data Mining?
Text and data mining, first defined as “the discovery by computer of new, previously unknown information, by automatically extracting and relating information from different (…) resources, to reveal otherwise hidden meanings”, is the procedure that allows someone to derive information from machine-readable works (text, images and sounds), by copying large quantities of material, extracting the data, and recombining it to identify patterns.
Today, having a much bigger amount of data collected (from search engines, social media, apps, wearable devices, etc) allows us to analyze large data sets with different and complex structures, combining them to identify all forms of correlation and patterns of behavior. This is relevant non only for different research fields (statistics, health, education, operations, even fashion), but also for purposes such as machine learning, especially since the Artificial Intelligence needs to be “trained” by analyzing large amounts of information.
From a legal perspective, we should ask whether and to what extent taking a big quantity of digital material to extract and analyze patterns of data can be considered legal or not.
In the view of the above, we could approach the topic from two different perspectives: copyright and data protection. For the purposes of this article, we will analyse the first one.
Copyright law for Text and Data Mining
Copyright law is called upon by TDM under many perspectives, as often the use of databases triggers not only the rights related to database owners, but also those related to the authors of the single work or document taken into analysis, who have the right “to authorise or prohibit direct or indirect, temporary or permanent reproduction by any means and in any form, in whole or in part”. 
A practical example for this could be provided by online translators: online services such as Google Translate are powered by Statistical Machine Translator (SMT) algorithms, which derive their “languages” from a purely statistical and probabilistic analysis of previous human-made translations, after having gathered records of international tribunals, company reports, articles and books in bilingual form that have been voluntarily uploaded on the web by individuals, libraries, booksellers, authors and academic departments. Today, the largest freely accessible multi-lingual corpus of documents is probably to be found amongst the different websites of the European Union Bodies, a database that includes official papers available in all the 23 official languages of the Union.
However, this type of SMT would realistically result in an extremely formal language for the translator while, on the other hand, “training” the algorithm on a more colloquial language such as the one used in novels or newspapers is very likely to result in an infringement of copyright law.
How can TDM be operated while respecting the boundaries of intellectual property?
The new European copyright directive (2019/790/EU) tried to answer this question by adding some exceptions to copyright protection and reproduction rights that are explicitly meant to be for TDM purposes.
The 2001 Information Society directive (directive 2001/29/EC) already provided a general exception for temporary acts of reproduction, which means acts that are “transient or incidental [and] an integral and essential part of a technological process and whose sole purpose is to enable:
(a) a transmission in a network between third parties by an intermediary, or
(b) a lawful use
of a work or other subject-matter to be made, and which have no independent economic significance”.
For the purposes of the 2001 directive, an example of temporary reproduction could be web streaming on YouTube, where the copyrighted audiovisual material is displayed on screen by the end-user and only temporarily saved in the cache of the device (computer, smartphone etc): as noted by the CJEU, the online reproduction terminates when the user leaves the web page, while the cached copies are normally replaced by other content after a certain time. 
The same principle could theoretically be applied to TDM activities, as during the analysis process the operative system stores the data in RAM, but erases any trace of the volatile copies when turned off. Despite this, it is very likely that the miners, especially those acting for research purposes, will retain the data corpus for later use (e.g. verification, aggregation with new data sets and further analysis), probably deleting it only once their work is completed and published. In this case, it is doubtful whether the exception for temporary reproduction would apply.
Updating EU copyright law: the Directive 2019/790
Although article 5(1) of the InfoSoc directive still applies, the widespread of A.I. applications and services run by algorithms made it urgent for the European legislator, when discussing the draft of the new copyright directive, to consider the introduction of specific exceptions related to text and data mining.
A preliminary phase of the discussion resulted in a exception for research activities that – and this is a rather innovative approach compared to the InfoSoc directive – was deemed to be mandatory upon all Member States (meaning that the adoption of the directive would have obliged them to introduce the exact same exception also in national law, without allowing the introduction of any condition or “without prejudice to”). Such an exception covers “reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.”
It should be noted that the concept of “lawful access” provided by the EU legislator is now quite broad, covering “access to content based on an open access policy or through contractual arrangements between right holders and research organisations or cultural heritage institutions, such as subscriptions, or through other lawful means” but also “access to content that is freely available online.”
Nonetheless, the effectiveness of the application of article 3 is still limited considering that, according to the definitions of “research organisations” and “cultural heritage institutions” given by article 2(1) and 2(3), there are still many subject that fall outside of the scope of the provision: journalists, start-ups, any other kind of data miners with commercial purposes.
Article 4: the step further
After the first rounds of negotiations (which resulted in an uprise of private companies and lobbies), the EU was still facing the issue of not being able to guarantee enough certainty to every actor in the play. The problem is that where the legislative framework for developing particular kinds of business (such as companies involved in artificial intelligence, which need access to training data) is lacking, those businesses are forced to move their service outside of the EU, with clear detriment to European economy and technological development.
Facing this challenge, the European legislator introduced a further exception to the draft of the new directive, which was later approved and adopted on April 17th, 2019. This further element can be found in article 4, which provides for an exception related to “reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining”.
Article 4 covers everybody and is both purpose-neutral and technical neutral, if combined with the definition of “text and data mining”, which according to article 2 means: “any automated analytical technique aimed at analysing text and data in digital form [thus including also the digitalised copy of offline materials] in order to generate information which includes but is not limited to patterns, trends and correlations”.
The application of the provision is nevertheless limited from two different perspective. First of all, under this exception text and data miners are only allowed to keep the data “for as long as is necessary for the purposes of text and data mining” whereas, under article 3, research organisations and cultural heritage institutions are allowed to retain it also for verification of the results etc. Moreover, this exception can be contractually excluded, as art. 4 states that it “(…) shall apply on condition that the use of works (…) has not been expressly reserved by their rightholders”). For example, a database owner can issue a licence saying that his database is not available for text and data mining purposes. This limitation will not apply to researchers (as the exception related to them is mandatory for anyone) but will inevitably affect any intention to scale the result for commercial purposes.
Overall, the introduction of article 3 and 4 in the new Copyright Directive finally suggests an acknowledgment of the importance of TDM and, accordingly, a first attempt towards a more tech-friendly legal framework, especially to ease and foster the widespread of AI applications.
As a technology that could be employed in different fields and pursue different purposes, with great prospects of scalability at a relatively low cost, text and data mining techniques actually see in copyright law one of its few, yet not less dreadful, obstacles.
The circumstance that most of AI applications carry out their activities only on the data extracted, not on the works themselves, does not seem to be functional to weaken the debate, as the source works nonetheless need to be copied before undergoing any analytical process.
In fact, since both article 3 and 4 of the Digital Single Market Directive define themselves as exceptions to copyright, “it has become apparent that TDM is indeed something that, by default, falls within the scope of copyright protection”.
We still have to wait until 2021 and see what path will the EU Member States draw when transposing the Directive in their respective national laws. The greatest risk, at it always happens in the context of uncharted new sectors and grey areas, is that drawing non-physical border between the regulated and unregulated could end up stressing the point of frictions of actual physical borders, creating significant positions of advantage and disadvantage among EU-based businesses and their non-EU competitors.
 M. Hearst, “Untangling text data mining”, June 1999, available at:
 E. Rosati, “Algorithmic Fashion and Copyright: the Regulation of Text and data minin to detect and anticipate future trends”, July 2019, available at https://www.iusinitinere.it/algorithmic-fashion-and-copyright-the-regulation-of-text-and-data-mining-to-detect-and-anticipate-future-trends-22708
 Directive 2001/29/EC, art. 1
 C. Grajales, “The statistic behind Google traslate”, June 2015, available at: https://www.statisticsviews.com/details/feature/8065581/The-statistics-behind-Google-Translate.html
 Directive 2001/29/EC, art. 5(1).
 E. Rosati, “CJEU says that you can keep browsing the internet without (copyright owners’) permission”, June 2015 http://ipkitten.blogspot.com/2014/06/breaking-news-cjeu-says-that-you-can.html
 Recital 9 directive 790/2019: There can also be instances of text and data mining that do not involve acts of reproduction or where the reproductions made fall under the mandatory exception for temporary acts of reproduction provided for in Article 5(1) of Directive 2001/29/EC, which should continue to apply to text and data mining techniques that do not involve the making of copies beyond the scope of that exception.
 Directive 2019/790/EU
 Directive 2019/790/EU, art. 3
 Directive 2019/790/EU, art. 2(1) ‘research organisation’ means a university, including its libraries, a research institute or any other entity, the primary goal of which is to conduct scientific research or to carry out educational activities involving also the conduct of scientific research:
(a) on a not-for-profit basis or by reinvesting all the profits in its scientific research; or
(b) pursuant to a public interest mission recognised by a Member State; in such a way that the access to the results generated by such scientific research cannot be enjoyed on a preferential basis by an undertaking that exercises a decisive influence upon such organisation;
 Directive 2019/790/EU, art. 2(3) ‘cultural heritage institution’ means a publicly accessible library or museum, an archive or a film or audio heritage institution;
 L. Koschwitz, “The EU just told data mining startups to take their business elsewhere”, September 2016, available at https://www.euractiv.com/section/digital/opinion/the-eu-just-told-data-mining-startups-to-take-their-business-elsewhere/
 E. Rosati “Copyright as an Obstacle or an Enabler? A European Perspective on Text and Data Mining and its Role in the Development of AI Creativity”, September 2019, available at: https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjlzo2G05zlAhVNYVAKHWAZBakQFjAAegQIABAB&url=https%3A%2F%2Fwww.ssrn.com%2Fabstract%3D3452376&usg=AOvVaw06wMa9244jo5IcpMO8_74M.
Classe 1996, frequenta l’ultimo anno di Giurisprudenza presso l’Alma Mater Studiorum-Università di Bologna. Da tempo interessata al rapporto fra diritto e nuove tecnologie e desiderosa di approfondire questa tematica con un periodo di studio all’estero, ha deciso di trascorrere un semestre di exchange in Australia. Qui ha frequentato la UTS: University of Technology Sydney, dove ha seguito corsi inerenti a materie quali proprietà intellettuale, informatica e innovazione imprenditoriale.
Attualmente si trova in Estonia, dove collabora con il ruolo di Research Trainee presso l’IT Law Programme dell’Università di Tartu.
Nel febbraio 2017 ha iniziato a collaborare con ELSA Bologna (the European Law Students’s Association) per poi assumere la guida dell’area Attività Accademiche in qualità di Vicepresidente e, infine, arrivare a ricopre il ruolo di Presidente.
È Senior Associate Editor della University of Bologna Law Review, realtà con la quale collabora dal 2016.