Moshe Koppel and Avi Shmidman
Automated text analysis is among the main fields revolutionized by recent advances in artificial intelligence (AI). One of the very early successes of the field involved the digitization of classical Jewish texts and the development of search technology for such digitized texts, pioneered by the Bar-Ilan Responsa Project[1] in the 1960s. The digitized corpus of the Responsa Project has steadily grown over the years, now numbering well over half-a-billion words. Other competing databases have sprouted up as well; significantly, these include Sefaria, the first free alternative to the costly Responsa Project, providing a corpus of substantial proportions on the Internet without copyright restrictions.
If we were to ask how the world of Jewish textual databases has changed in the past few decades, we could probably point only to increases in the number, size, and accessibility of such databases. This begs the question: where are the technological breakthroughs? Is the idea of “Torah Technology” still essentially limited to searchable databases? Have we simply been expanding the size and distribution of the same essential idea conceived back in the 1960s?
Perhaps so. Fortunately, however, with so many texts now fully digitized, and with so many of those now freely available for download and reuse, the foundation is now in place for a new wave of Torah technology. Automated tools for intelligent processing and analysis of these textual corpora are currently being developed in AI labs and will soon explode onto the scene. In what follows, we’ll review some of the innovations that can be anticipated as current natural language processing (NLP) and machine learning methods are applied to large corpora of Jewish literature. We’ll also consider some of their likely repercussions, both positive and negative.
Three Trends in Contemporary Torah Literature
Our key claim is that, for the foreseeable future, the application of digital technology to Torah study will mainly extend three trends in contemporary Torah literature that have become clearly discernible with the growth of Torah study in the United States and Israel. These three trends are as follows:
First, classical texts have been made accessible to broader audiences, including those without strong backgrounds in Torah study. Thus, for example the Steinsaltz Talmud, and its many successors, have popularized the study of Gemara by introducing vocalization, punctuation, opened abbreviations, and more, all accompanied by explanatory notes and commentary. Similar popularizations have been made available for other fundamental texts, such as Rambam’s Mishneh Torah,[2] Midrash Rabbah,[3] and classical commentaries on Tanakh and Talmud.[4]
Second, many classical texts have been published in “scientific” editions, which identify the sources from which the text draws, mark variant versions of the text based on comparisons of manuscripts and parallel sugyot, and suggest corrections to the commonly-used versions. These include, for example, the Chavel Ramban al ha-Torah, the Frankel Mishneh Torah, and various editions of rishonim published by Mossad Harav Kook.
Finally, there has been an outpouring of books that assemble, organize, and summarize extant material on a given topic. The most prominent of these works is the (ongoing) Encyclopedia Talmudit, but this genre also includes numerous book-length works on areas of Halakhah ranging from the laws of Shabbat to the laws of shiluah ha-kein, as well as yalkutim such as those found in the Oz Vehadar Mesivta shas and the Frankel Mishneh Torah.[5]
While these genres appeal to disparate audiences, they have two things in common. They are not platforms for major conceptual breakthroughs or novel interpretations, and they all will ultimately be better handled by computational techniques than by intensive manual labor. Just as search engines have made obsolete the remarkable concordance of Haim Yehoshua and Binyamin Kosovsky,[6] future technologies will render obsolete many popularizations, scientific editions, and anthologies.
How Does it Work?
Popularization: vocalization, punctuation, abbreviations
We begin with the main challenges in producing accessible texts: vocalization, punctuation, and opening abbreviations. Of course, to a certain extent, these challenges can be met with straightforward lookup tables. We don’t necessarily need artificial intelligence to vocalize קמיפלגי as קָמִיפַּלְגִי, nor to expand פחמש”פ as פחות משוה פרוטה. However, in the majority of cases, words and abbreviations are ambiguous. Should דבר be vocalized as דָּבָר or דֶּבֶר or דַּבֵּר? Similarly, if we encounter the abbreviation אא”א, is this אלא אי אמרת or או אינו אלא? Or perhaps it is איסור אשת איש? Cases like these can only be resolved by taking the surrounding context into account. That is, in order to properly elucidate these words, we must devise an algorithm which relates not to individual words, but rather to sequences.
The current leading approach in machine learning, multi-layer neural networks (known informally as “deep learning”), provides a solution. Deep learning has proved astonishingly effective for a wide variety of tasks, from image recognition to automatic text translation. One subtype of these neural networks – termed “recurrent neural networks” – is specifically geared for challenges involving sequences of widely varying sizes, and thus perfectly suited to natural language, which consists of variable-size sequences of characters and words. One of the primary modes of working with deep learning is to set up a system which predicts the next symbol in a sequence of symbols, given sufficient examples of similar such sequences. Conveniently, many fundamental problems in computational linguistics can be translated into just the kind of sequence prediction problems that deep learning handles well. And, indeed, the three challenges we have mentioned here – vocalization, abbreviation expansion, and punctuation – can each be treated as a sequence prediction problem.
For example, one can think of vocalized texts as sequences of the form letter-vowel-letter-vowel etc. A neural network can be trained on a large corpus of vocalized text presented as such a sequence. Even without having been provided with any explicit information about morphology or syntax, the network picks up and encodes implicit patterns in the data. The neural network learns these patterns in a generalized form, abstracted away from the specific words and sentences which were provided as input. The neural network can then use these implicit generalized patterns to vocalize a completely new text which it has never seen before, by predicting which vowel is most likely to “succeed” any given letter, given the surrounding words and letters.
Such systems require a lot of training data and they can be rather fiddly, so the process of training can often be lengthy and painstaking. Assembling the training data can be difficult. Typically, one would start with some manually produced vocalized text (as a practical matter, this often requires dealing with licensing issues), which might be adequate for a very imperfect automated vocalization tool. This imperfect tool can be used to generate more vocalized text, which can then be corrected manually by an expert and used as additional training data, yielding a better system that produces more and better training data. Bootstrapping in this way, increasingly accurate training data can be generated at an accelerating pace.
Very similar methods can be used to train a system to punctuate a text and open abbreviations. Given training corpora containing punctuation and fully expanded abbreviations, a neural network can be trained to predict the relevant punctuation mark (if any) after any given word within a sentence, and to determine the relevant expansion for any given abbreviation within a text.[7]
Thus, for example, the text shown in the top line of the accompanying figure – a line from the Or Zarua,[8] copied verbatim from sefaria.org – would automatically be rendered as shown in the bottom line, complete with nikud, punctuation, and expanded abbreviations. It should be noted that a level of caution needs to be exercised when expanding abbreviations; while we certainly do want to expand בכ”מ to בכמה מקומות, it would not be desirable to expand הרמב”ם into הרב משה בן מיימון, nor to expand בש”ס into בששה סדרי משנה. The artificial intelligence employed to expand abbreviations must also be leveraged to figure out when not to expand.
וכ”כ הרמב”ם בהקדמה לפי’ המשניות בד”ה וכאשר מת יהושע כו’ וזהו דאי’ בש”ס בכ”מ הלכתא גמירי להו |
וְכָךְ כָּתַב הָרַמְבָּ”ם בַּהַקְדָּמָה לְפֵרוּשׁ הַמִּשְׁנָיוֹת, בְּדִבּוּר הַמַּתְחִיל “וְכַאֲשֶׁר מֵת יְהוֹשֻׁעַ” וְכוּ’, וְזֶהוּ דְּאִיתָא בְּשַׁ”ס בְּכַמָּה מְקוֹמוֹת “הִלְכְתָא גְּמִירִי לְהוּ.” |
Scientific editions: error correction, source identification, and authorship analysis
Scholars accustomed to navigating rabbinic texts with relative ease may find tools for enhancing ease of reading to be mostly superfluous. But they will get more use from the kinds of automated tools that will eventually be able to produce scientific editions of any rabbinic work. Specifically, scholars might wish to obtain more accurate versions of classical works, as well as identification of the sources of each line in the text. Let’s see how this could be done.
Consider first the problem of producing accurate texts. There are two different situations in which problems of text accuracy arise. In one, we have multiple textual witnesses (that is, versions of the same text) and we wish to compare them and resolve disagreements among them. In another, we have a single text, but we wish to root out mistakes that might have crept into the text at various stages: scribal, printing, or digitization.
For the case of multiple textual witnesses, the first step is to create digital versions of the extant manuscript evidence. Optical character recognition (OCR) has long been an option for printed texts; however, on manuscripts, traditional OCR programs fail miserably. Fortunately, recent advances in the use of deep learning for image processing have yielded new algorithms for Handwritten Text Recognition (HTR). These algorithms produce astonishingly effective results when applied to medieval Hebrew manuscripts, and they thus provide the path for quick and efficient digitization of the extant textual witnesses.
Next, we wish to create what is known as a “synoptic text” (or, “synopsis”) — something like an Excel sheet in which each version is written along a row and each column consists of parallel versions of the same word/concept. Creating such a synopsis involves a more subtle procedure than one might expect. Obviously, words that are orthographically similar should be aligned together, even if they are not identical; thus, for example, the words שמעשיהם and שמעשיהן would align together. However, even words that do not have any letters in common at all will need to be aligned together if their meaning is near-synonymous. Thus, if one manuscript says כל שאתה רוצה לעשות בעולמך, and another says כל שאתה מבקש לעשות בעולמך, the two phrases should be aligned word-for-word, despite the orthographic distance between מבקש and רוצה.[9] Furthermore, if one text has אמר רבי עקיבא and the other has אמר רבי ישמעאל, we must put עקיבא and ישמעאל in the same column since they are clearly parallels, even though they are neither orthographic variants of the same word nor even synonyms. By contrast, if one text has אמר רבי and the other has אמר להו, then רבי and להו should not be aligned in the same column.
Thus, automating the process of synopsis construction entails, inter alia, determining if any given pair of words are synonymous, regardless of orthographic similarity, and, at the same time, determining if any given non-synonymous pair of words might belong to the same semantic category. In both cases, the words must be aligned together in a single column. As a result of some significant breakthroughs for this purpose, along with improvements in the efficiency of certain string-matching algorithms, it has recently become possible to automatically create synoptic texts from multiple textual witnesses of Hebrew texts.[10]
While automated synopsis construction itself is interesting and important, it is actually most significant as a first step in a more grandiose task: reconstruction of the “ur-text,” the common ancestor, if there is one, of all the textual witnesses available to us. There is a rather elegant method for achieving such a reconstruction. To appreciate the basic idea, consider a fairly obvious starting point. Each column of the synopsis includes the alternative possibilities for a particular slot in the text; we resolve that column by choosing the alternative that appears the most times in the column. Since mistakes are presumably relatively infrequent, this simple trick should give us a fair approximation of the original text. Of course, this initial “majority-wins” approach is relatively naive and often not valid; in most cases, certain manuscripts hold more or less weight than others. For instance, Talmud scholars have noted that the Hamburg Nezikin manuscript (Ms. Hamburg, Cod. Hebr. 165[19]) contains a particularly reliable text, and thus should be assigned significantly more weight than the rest of the textual witnesses; in contrast, the Munich manuscript of the entire Babylonian Talmud (Ms. Munich 95) is particularly prone to errors, and should be assigned less weight than others. Fortunately, an artificial intelligence algorithm called “expectation–maximization” provides a solution, allowing us to automatically calculate the appropriate relative weight of each manuscript in a given non-interdependent set of textual witnesses. After determining these weights, we can then compute and output the full “ur-text.”[11]
It should be noted that this algorithm assumes that the textual witnesses fed to the algorithm do not demonstrate direct dependence upon one another. In the case where the input set contains multiple textual witnesses from one given transmission branch, the algorithm can first be run on each branch individually. Afterward, the output texts from these individual runs – one per transmission branch – can be gathered together and fed to the algorithm in order to produce the overall “ur-text.”
Consider now the case in which we have only a single version of a text and suspect that there may be errors in the text, which we wish to identify and correct. The deep learning methods we described above for predicting the next item in a sequence can be used to assign probabilities to the next character in a text, based on implicit patterns of morphology, syntax, and lexical choice. Thus, if a letter appears in the text we have, but is assigned very low probability by the model — that is, it is very unexpected in that context — we can mark it as a possible error. The variability of language is such that it is unlikely that we will ever be able to mark all the errors without capturing in our net some surprises actually intended by the author. But it would be sufficient if the method were accurate enough that it could be manually reviewed without too much effort; for example, if we could mark 1000 words in a million-word book as suspicious, and these included almost all of the actual errors, it would obviously be a much simpler matter to manually check the marked words than to check the entire book.[12]
Another important challenge for scholars that can be automated is the identification of sources used in a text of historical importance. Recently developed techniques make it possible to enter a text of any length and obtain a footnoted version of the text in which every quote (exact or approximate) from an earlier text is identified and marked. Thus, if a rishon quotes a pasuk, a Gemara, a midrash — with or without attribution — the quote will be identified and a footnote inserted linking to the original source. This should be done even if the source has been altered in terms of its orthography and word order, and even if some of the words have been omitted or interpolated.[13]
Yet another area of scholarly interest that can be handled by machine learning techniques is authorship analysis. Thus, for instance, given a set of examples of responsa written by Rashba and Ritva, respectively, a machine learning algorithm could exploit lexical, morphological, or syntactic preferences of each author to produce a set of rules that could be used to determine if a previously unseen responsum was authored by Rashba or by Ritva. (Notably, the most consistent and reliable differences between Rashba and Ritva are not content words or phrases, but rather more subtle items, such as their use of conjunctions and other function words. For instance, among other things, the algorithm notices that words such as את, שמא, and שכן are twice as frequent in Rashba’s responsa as in those penned by Ritva; while on the other hand, words such as כי and הזה are found with much higher frequently in Ritva’s responsa.) Similarly, such an algorithm can pinpoint the subtle stylistic differences between the biblical commentaries of Ramban and Rabbeinu Behaya (the most salient difference between them is the use of the word כאשר – six times more likely to be found in any given paragraph of Ramban’s commentary).
This method can be extended to handle a variety of problems: Were two texts written by the same author (author verification)? How can a multi-author text be decomposed to its authorial components (source analysis)? When and where was a given text composed (author profiling)? Which words, or morphological or syntactic structures, are markers of a given period, region, genre, or author?[14]
Topic summarization
We have seen how the preparation of popular and scientific editions of classic works can be automated. Now let’s see how topic summarization can be automated as well.
Clearly, any attempt at summarizing a sugya would begin with collecting all relevant sources; under current circumstances, this means using a search engine. Unfortunately, search tools available for Jewish texts are inadequate in a number of ways. Some of these inadequacies are familiar to anyone who uses standard search engines for Hebrew queries. On the one hand, results are incomplete: one doesn’t obtain alternative spellings or morphological variants (masculine/feminine, singular/plural, tenses, with/without conjunctions or prepositions, etc.). On the other hand, irrelevant results are included: due to the ambiguity of most unvocalized Hebrew consonant strings, one obtains unintended senses. For example, a search for עם in the sense of “nation” would result in a flood of results which feature the word עם as a preposition (“with”). Similarly, these search engines do not provide any way to differentiate a search for ha-par (“the heifer”) from heifer (“he annulled”). Thus, decent search results require tools that can use context to disambiguate each occurrence of a word with more than one possible meaning.
But, especially when searching classical texts, there are other ways in which standard search tools yield poor results. Suppose we wish to find the main rabbinic sources regarding, say, the use of soap on Shabbat. Well, we might know that the contemporary Hebrew word for soap is סבון, but not know the rabbinic word (בורית). Without a thesaurus mapping the contemporary word to the rabbinic word, we wouldn’t get much. Recent developments render the automated construction of cross-era thesauri practical. These thesauri will be integrated into the search engines, so that any given search term can be automatically expanded to all of the equivalent terms as used in Rabbinic Hebrew.[15]
But consider another problem. It may be that the crucial sources for our purposes deal with the underlying conceptual issues regarding the use of soap on Shabbat (for example, the definition of the forbidden activity of ממרח), without specifically mentioning soap. How could we induce a search engine to point us to these sources? One novel method is as follows: we first find sources that explicitly mention soap using standard search procedures (call these “search results”), and then find later sources (such as a responsum of R. Ovadiah Yosef) that cite or quote many of these sources (call such a responsum a “hub”). Finally, we examine the paragraphs in which those initial search results are cited by the hubs, collecting the other early sources which are quoted by multiple hubs in the same context. These additional early sources are most likely just as relevant to the initial search, but due to fluctuations in formulation might not have been returned as part of the initial search results. We provide these additional sources to the user as “extended search results.”
Note that such extended search results are known to be important and relevant because they are cited by hubs in relevant contexts, even though they might not mention the search term at all. By returning hubs and extended results, this method is likely to give us all the sources required for an overview of a sugya. In fact, these extended search results can be thought of as “curated” results, in the sense that they have been selected by reliable hubs as relevant.
As before, it should be clarified that these automatically curated results will still contain a certain percentage of false positives, and their use will require some educated filtering and review. Thus, these results are not quite as helpful as the summaries currently provided by encyclopedias and yalkutim, which are manually curated and organized. Nevertheless, it goes a long way in that direction; indeed, some educated users might actually prefer to be provided with a less selectively curated assortment of the most relevant sources than one gets in yalkutim, so that they might determine the key sources for themselves.
What is the Current State of Play?
The tools described above are not all available yet, but the underlying technologies already exist. So let’s try to describe the current state of play a bit more precisely: what is already available, what can be expected in the short-term and the medium-term, and what is unlikely to turn up in the foreseeable future?
Reliable vocalization of Hebrew texts is already available, though it is currently more accurate on modern Hebrew texts than for rabbinic texts that mix Hebrew and Aramaic.[16] Automated abbreviation expansion is available as well.[17] Punctuation is still in the laboratory stage[18], but will likely be available within a year or so.
Automated synopsis construction is already available.[19] Reconstruction of the stemma (a graph showing which manuscripts draw on which earlier manuscripts) and the ur-text are in the laboratory stage, as is error-correction of single manuscripts using language models.
Tools that identify quotes from earlier sources are currently effective for identifying paraphrases from biblical, Mishnaic, and Talmudic texts.[20] Identification of paraphrases from any prior source is in the laboratory stage. (Biblical and Talmudic texts are currently easier to work with as a result of the availability of manual annotation indicating the morphology and lexeme of each word in the text.)
Tools for solving standard authorship attribution problems (was a text written by Author X or Author Y) are already in wide use. Tools for authorship verification and forgery detection (are two texts by the same author) and tools for source criticism (distinguishing stylistic or authorial threads within a text) are available and, though they still need more work to extend their scope and reliability, have already produced interesting results. Thus, for example, author verification has been used to solve well-known questions regarding pseudepigrapha: Was the book of responsa, Torah Lishmah, alleged in its preface to have been written by one Yechezkel Kahli, actually a youthful work of Yosef Hayim of Baghdad (better known as Ben Ish Hai)? Indeed it was.[21] Were the letters signed by early hasidic masters, allegedly found in a trove in St. Petersburg after the Russian revolution, authentic? They were not.[22] Who wrote the anonymous anti-Rabbinic polemic קול סכל? It turns out that it was written by Leone Modena, the same author who penned the scathing critique of that very same book.[23]
Advanced search tools that overcome common problems associated with orthographic and morphological variants already exist for biblical and Talmudic texts.[24] For instance, on the orthographic plane, one can search ונטמאתם and find ונטמתם (Lev. 11:43); and, on the morphological plane, one can search גמל, and then narrow down the results to those results that deal with camels (e.g., Gen. 24:46: וְגַם גְּמַלֶּיךָ אַשְׁקֶה), or to those results that deal with retribution (e.g., Ps. 137:8: גְּמוּלֵךְ שֶׁגָּמַלְתְּ לָנוּ).
Handwritten Text Recognition tools for automatic digitization of Hebrew manuscripts are currently available as well.[25]
All of the above tasks have either been completed or will be completed in the next few years. But note that all these tools are merely efficient ways of providing that which has already been done manually. The difference is one of scale. Projects that have taken lifetimes or have built on the efforts of teams of skilled editors might be scaled to the entirety of Jewish literature in relatively rapid fashion. Once a viable algorithm is designed and developed, its application to the corpus as a whole is just a matter of raw computing power.
Thus, if until now, the trend of creating accessible versions – vocalized, punctuated, and with expanded abbreviations – has been applied to a very limited set of texts, automated methods could facilitate the production of accessible versions of virtually any text in the near future. The same is true regarding the scholarly world and the production of scholarly editions of Rabbinic texts. Until now, the painstaking work required to produce critical editions has resulted in a situation where only a small portion of the corpus has been properly edited. Only selected chapters of the Talmud,[26] Midrash,[27] and some other canonical works have ever been published in critical editions. Automated methods for the creation of critical editions of almost any book could be available in several years.
Furthermore, the free online accessibility of these algorithms means that we don’t have to wait for the results to trickle down from the publishers one volume at a time. Any person can take a rabbinic text and run it through a vocalization tool or an abbreviation expander. A word of caution is in order, however. As with so many other artificial-intelligence algorithms, the results are not yet 100% accurate; a certain percentage of the words will be incorrectly vocalized, and a certain portion of the abbreviations will be incorrectly expanded. Fortunately, the online tools referenced above provide fully-featured graphic interfaces to efficiently review the results, allowing users to select from alternate possibilities as relevant. Nevertheless, this means that the tools cannot be used blindly; rather, they require proofreading and review, and they require some a priori knowledge as to what the resulting text should look like. Therefore, at the current stage, the tools are most relevant as an aid to maggidei shiur and school teachers, to help them produce highly readable and maximally effective source-sheets for their pupils. Yet, in due time, these tools can be expected to reach sufficient accuracy for any layman to be able to use them as aids to Torah study.
What, then, can’t current computational methods handle? In short, any task that assumes knowledge about the real world and, in particular, insight into human nature. Once we have assembled accurate texts and have at our disposal all the texts relevant to a particular matter, can machines help us to understand the central underlying concepts, to draw relevant analogies and distinctions, to decide what to do when faced with a halakhic question? This is a much taller order. Even such prosaic tasks as determining the bottom line of a responsum, distinguishing prudential considerations from formal textual arguments, identifying leniency and stringency, isolating the specific circumstances upon which a given ruling depends, require a finer degree of knowledge, common sense, and reading comprehension than machines are currently capable of.
Is this Good for the Jews?
The availability of tools such as we have described here are more than mere conveniences; they are likely to subtly change the way we approach and relate to classical Jewish texts. Let’s consider and evaluate these potential changes.
Some twenty five years ago, Haym Soloveitchik published his masterful essay entitled “Rupture and Reconstruction: The Transformation of Contemporary Orthodoxy,” in which he identified the decline of the mimetic tradition of Orthodox Judaism, and the rise of the textual tradition in its stead.[28] Whereas halakhic knowledge was once primarily attained by observation in the home or in the local school, the new generation preferred authoritative knowledge sourced in texts. Yet, as Soloveitchik strongly argued, this did not mean that most people turned directly to the texts, for the overwhelming majority of halakhic sources remained largely inaccessible to the layman. Rather, the populace turned instead to those who they perceived to be the masters of the texts: the roshei yeshiva and the high-level Torah academies. That is, knowledge was still obtained via personal transmission, but the transmission from parents to children was largely replaced with transmission from Talmudic masters to their students.[29] And this, argued Soloveichik, also led to the sudden rise of da’at Torah.[30]
The technological advances discussed herein may well cause a subtle shift in a different direction. The mantle of knowledge was previously transferred from the home to the roshei yeshiva; yet now, with gradually improving technology that both curates the most relevant sources on a sugya and renders these texts more accessible to laymen, the need for any kind of human transmission might be subtly diminished. Of course, the diminished need for human transmission has already been facilitated by the mere availability of texts on the Internet, but we focus here on the acceleration of this process as a result of automated tools that, to some extent, themselves play the roles of teachers in selecting and elucidating relevant texts.
On the one hand, this is a blessing: it broadens the circle of those participating in one of the defining activities of Judaism, including those on the geographic or social periphery of Jewish life. But there’s an equally obvious potential downside to this. The traditional process of transmission of Torah from teacher to student and from generation to generation is such that much more than raw text or hard information is transmitted. Subtleties of emphasis and attitude — what topics are central, what is a legitimate question, who is an authority, what is the appropriate degree of deference to such authorities, which values should be emphasized and which honored in the breach, when must exceptions be made, and much more — are transmitted as well. All this could be lost, or at least greatly undervalued, as the transmission process is partially short-circuited by technology; indeed signs of this phenomenon are already evident with the availability of many Jewish texts on the Internet.[31]
Now consider academic scholarship, which has to some extent focused precisely on those aspects of Torah study most prone to automation. It has often been noted, for example, that the compilation of concordances of tannaitic and amoraic literature, on which the Kosovskys spent almost the entirety of the 20th century, can be substantially, if imperfectly, reproduced in minutes. But, perhaps more shockingly, even many aspects of a work of overarching genius such as Tosefta Kifshutah of Prof. Saul Lieberman might soon be reproduced efficiently. Quotes and paraphrases of Tosefta in later sources can be systematically identified, and variant readings gleaned from these sources, as well as from digitized manuscripts, can be compared. This hardly covers all of Lieberman’s work on the Tosefta – his running commentary is unparalleled and worth its weight in gold – but it does cover a good deal of it. If this is true of Lieberman, it is true a hundred-fold for ordinary scholars producing scientific editions of classical texts.
This means that scholars, having been largely freed up from technical tasks that can be handled computationally, will have more time to contemplate the bigger ideas. Is this good for the world? To the extent that the marketplace of ideas will be flush with new supply competing to fill the demand, this is a good thing. With technical drudgery automated, we might see a renaissance of novel ideas coming from the academy.
But let the buyer beware. If some current trends are indicative, when those who have the training and skills to compare textual witnesses and track down citations are freed up to peddle “big thoughts,” the results are liable, indeed likely, to include a flood of papers on fashionable postmodern nonsense. (“Deconstructionism and Dadaism in the Shev Shmaitsa”, anyone?)
As for Torah scholars, prima facie, those most deeply embedded in particular traditions of learning would be least affected by the advances we have described here. Trained scholars have no particular need for easy-to-read vocalized editions of the technical literature. Nevertheless, Torah scholars also presumably stand to benefit from the new technologies regarding critical editions and advanced search capabilities. They will be able to draw upon a wide corpus of newly corrected texts, and they will be able to gather a wider range of texts on any given topic than all but the rarest scholars could previously pull out of memory. This is no doubt a blessing.
But, once again, one might wonder if this blessing is an unalloyed one. Imagine that every student of a given topic were to see the same related sugyot and the same “corrected” text. Would this lead to homogeneity of thought? Would it lead to propagation of errors when the algorithms ״corrected״ incorrectly? Should we, following Hazon Ish, regard the imperfect collection and versions of texts that shaped traditional thinking as ordained or at least canonical by definition? Might the easy availability of variant readings and obscure texts serve as a distraction and draw students towards sterile pedantry?
With regard to variant readings, much depends on how the new computerized texts are presented and utilized. For example, as we explained above, when it comes to variant readings within the text, computational methods have the ability to both widen and constrict. On the one hand, we would have the ability to present the reader with the widest possible apparatus criticus, collating all extant witnesses and parallel passages, and presenting them in an easy-to-read synoptic format (rather than the dense apparatus shorthand which was traditionally used to conserve printing costs), so that every individual variant immediately jumps out and catches the attention of the reader. On the other hand, we also would have the ability to algorithmically determine the “best” nusah. The former could lead either to healthy creativity or unhealthy cherry-picking of convenient readings; the latter could lead either to enhanced accuracy or to stifling homogeneity.
As long as we are speculating on such matters, let’s do a thought experiment. Grandiose AI projects like IBM’s Watson, which defeated the best human Jeopardy champions, garner headlines and evoke fanciful dreams of a similar automated system (posAIk?) that might answer halakhic questions with greater alacrity than any rabbi. No such system appears imminent, but let us entertain the possibility that one day in the not-so-distant future, text analysis algorithms become sufficiently accurate to respond plausibly to halakhic questions, even to the extent of offering what an expert might regard as competent and reliable halakhic decisions. What would be the social consequences of this?
First of all, this would put accurate pesak at everyone’s fingertips. That’s great. Real poskim might even find such a tool helpful for formulating a decision. Wonderful.
But, such a tool could very well turn out to be corrosive, and for a number of reasons. First, programs must define raw inputs upfront, and these inputs must be limited to those that are somehow measurable. The difficult-to-measure human elements that a competent posek would take into account would likely be ignored by such programs. Second, the study of Halakhah might be reduced from an engaging and immersive experience to a mechanical process with little grip on the soul. Third, just as habitual use of navigation tools like Waze diminish our navigating skills, habitual use of digital tools for pesak is likely to dry up our halakhic intuitions. In fact, framing Halakhah as nothing but a programmable function that maps situations to outputs like do/don’t is likely to reduce it in our minds from an exalted heritage to one arbitrary function among many theoretically possible ones.
In short, Halakhah is preserved and developed as a human process that synthesizes book knowledge and moral intuition in subtle ways. Technical aids to this process will contribute significantly to accuracy, efficiency, and accessibility. At the same time, we must be cognizant that such tools could ultimately dim or even replace intuition, possibly resulting in alienation. We are probably still far from the point at which the long-term costs of technological aids to Torah study outweigh the manifest immediate benefits, but it behooves us to take into account that we may very quickly build up enough inertia to drive us well past that point in the future.
[1] Full disclosure: both authors are employees of Bar-Ilan University.
[2] E.g., the elucidated editions of the Mishneh Torah published by Mossad Harav Kook and the Steinsaltz Institute.
[3] E.g., the מדרש רבה המבואר series, published by מכון המדרש המבואר in Jerusalem (for Hebrew readers), and the Artscroll edition of Midrash Rabbah (for English readers).
[4] E.g., the elucidated versions of Rashi and Tosafot in the Mesivta edition of the Talmud Bavli.
[5] Regarding the preponderance of books summarizing specific areas of Halakhah, see Haym Soloveitchik, “Rupture and Reconstruction: The Transformation of Contemporary Orthodoxy,” Tradition 28:4 (1994), p. 68, especially the extensive material referenced in footnote 8. The phenomenon has only grown since Soloveitchick penned those poignant words over 25 years ago.
[6] Otzar Lashon ha-Talmud, Jerusalem 5714-5749.
[7] Regarding vocalization and abbreviation expansion, see: Avi Shmidman, Shaltiel Shmidman, Moshe Koppel, and Yoav Goldberg, “Deep Learning for Preprocessing Historical Hebrew Texts: Error Correction, Vocalization and Abbreviation Expansion,” ISCOL 2017 (https://www.dropbox.com/s/ca9fmv0jca9nydr/P1.pdf). Regarding abbreviation expansion see also: Y. HaCohen-Kerner, A. Kass, and A. Peretz, “Haads: A Hebrew-Aramaic Abbreviation Disambiguation System,” Journal of the American Society for Information Science and Technology, vol. 61, no. 9, pp. 1923–1932, 2010 [http://dx.doi.org/10.1002/asi.21367].
[8] From הלכות קריאת שמע, section 17.
[9] This is an actual example from Sanhedrin 38b; while almost all manuscripts have רוצה, the Munich 95 manuscript has מבקש.
[10] For a recent algorithm which allows the automatic creation of Hebrew synopses, taking account of semantic connections between words, see: Oran Brill, Moshe Koppel, and Avi Shmidman, “FAST: Fast and Accurate Synoptic Texts,” Digital Scholarship in the Humanities, 2019 (https://doi.org/10.1093/llc/fqz029).
[11] The key is to not stop after finding the plurality choice in each column. Instead, we use the tentative ur-text resulting from the initial step to infer the approximate quality of each textual witness. Given the estimated accuracy of each such witness (the extent to which it agrees with the tentative ur-text), we can assign each witness a weight that reflects its estimated accuracy. We can then recompute a tentative ur-text using the majority method but taking into account the weight of each textual witness. We can now update the weight of each witness in accordance with the current tentative ur-text, recompute the ur-text, update the weights of witnesses again, and so on. The process can be continued this way until convergence (that is, until nothing changes anymore). This method needs to be refined to take into account dependencies among manuscripts (say, if one was copied from the other) and between consecutive words (for example, you can’t take the first word of a two-word phrase without also taking the second word), but empirical tests strongly indicate that this method is extremely accurate. See M. Koppel and M. Michaely, Reconstructing Ancient Literary Texts from Noisy Manuscripts, NAACL Workshop on Computational Linguistics for Literature, San Diego CA, 2016.
[12] For an initial proposal regarding this method, see: Shmidman et al, “Historical Hebrew Texts” (above, n. 9). Other machine-learning based methods have been proposed for the same task; see for instance: Kissos, I., and Dershowitz, N., “OCR Error Correction Using Character Correction and Feature-Based Word Classification” in: Proceedings – 12th IAPR International Workshop on Document Analysis Systems, DAS 2016 (pp. 198–203). IEEE. https://doi.org/10.1109/DAS.2016.44.
[13] See: Avi Shmidman, Moshe Koppel, and Ely Porat, “Identification of Parallel Passages Across a Large Hebrew/Aramaic Corpus,” Journal of Data Mining and Digital Humanities, Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages, March 2018; Michal Bar-Asher Siegal and Avi Shmidman, “Reconstruction of the Mekhilta Deuteronomy Using Philological and Computational Tools,” Journal of Ancient Judaism 9 (2018), 2-25.
[14] For all of the authorship-related items discussed in this paragraph, see: M. Koppel, D. Mughaz, and N. Akiva, “New Methods for Attribution of Rabbinic Literature,” Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational, and Applied Linguistics, 2006.
[15] See: Chaya Liebeskind, Ido Dagan, and Jonathan Schler, “Semi-automatic Construction of Cross-Period Thesaurus,” J. Comput. Cult. Herit. 9, 4, Article 22 (2016), DOI: http://dx.doi.org/10.1145/2994151.
[16] Dicta, a Jerusalem-based research group of which both the undersigned are members, has developed such tools; the vocalization tool, for example, is available at no charge at https://nakdanpro.dicta.org.il/ and is already being widely used.
[17] Available free at: https://abbreviation.dicta.org.il/.
[18] Automated punctuation is being developed in a number of labs around the world, and tools for punctuation of rabbinic texts are being developed at Dicta.
[19] Both hachi garsinan and Dicta have made such tools available. Dicta’s automatic synopsis creation tool is available free at: http://synoptic.dicta.org.il/ . Hachi garsinan’s synopsis tool is available free at: https://fjms.genizah.org/.
[20] Available free at: https://citation.dicta.org.il/.
[21] Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow, “Measuring Differentiability: Unmasking Pseudonymous Authors,” J. Mach. Learn. Res. 8 (2007): 1261-1276.
[22] Moshe Koppel, “Zihui Mehabrim be-Shitot Memuhshavot: Genizat Herson,” Yeshurun 23 (2010): 559-566.
[23] Avi Shmidman, Moshe Koppel, and David Malkiel, “Leone Modena and Kol Sakhal: A New Approach” (forthcoming).
[24] Available free at: http://search.dicta.org.il (Bible search); http://talmudsearch.dicta.org.il/ (Mishnah and Talmud search).
[25] One leading HTR tool which has been shown to be effective with Hebrew manuscripts is Kraken, available free at: https://github.com/mittagessen/kraken.
[26] See, for instance, M. Sabato’s recent edition of the third chapter of Tractate Sanhedrin (Jerusalem 2018), as well as the various volumes of S. Friedman’s Talmud ha-Iggud series.
[27] See, for instance, M. Kahana’s five volume edition of Sifri Bamidbar (Jerusalem, 2011). However, some of the legal midrashim, and many of the aggadic midrashim, still remain unedited.
[28] Soloveitchik (above, footnote 5), pp. 64-130.
[29] Ibid., 94.
[30] Ibid., 95.
[31] A further potential drawback is suggested in a recent article by Gedalya Berger: he suggests that this preponderance of texts may lead to laxity of observance, because a non-expert will be able to locate any and all lenient positions within the corpus, even if only in one esoteric source. Further, Berger notes, increased accessibility and readability of halakhic texts will provide non-experts with the self-confidence to act upon the lenient positions recorded in these sources, even if the text is not one that holds particular authoritative weight in the grand scheme of pesak halakhah. See: Gedalyah Berger, “Some Ironic Consequences of Text Culture,” Tradition 51, 4 (2019), 14-15.