Throughout the millenia, humans have used graphical symbols. 5000 years ago, in Mesopotamia, a system of symbols was codified into what was to become the world's first writing system. But in addition to writing systems, many symbol systems have been developed that, though they do communicate information, do not encode language. Such non-linguistic symbol systems include familiar examples such as mathematical symbology, European heraldry, or scouting merit badges, as well as less familiar ones such as Mesopotamian deity symbols or Dakota winter counts. Like writing, such systems are an important part of the cultures that created them.

Now suppose one has a symbol system whose interpretation is unknown. Short of deciphering it as a writing system, or otherwise providing a rigorous testable interpretation, are there any methods that can determine whether one is dealing with writing or a non-linguistic system? Some recent high-profile papers have indeed claimed to provide statistical evidence that a couple of ancient systems were linguistic. But, one problem with that work is that in order to determine which of two categories an unknown system belongs to, it is very useful to have a large set of examples of the two categories in question. In the case of writing systems, we now have many electronic corpora from hundreds of languages, both ancient and modern. But electronic corpora of non-linguistic systems are few.

This project will fill this void by developing and releasing to the public electronic corpora of a range of non-linguistic systems, including those named above as well as several others. It will also investigate statistical and machine learning methods that might help in distinguishing written language from other graphical communication. And this in turn will lead to a better understanding of a fundamental question about humanity: what sets language apart from other forms of communication?

Project Report

Suppose you are an archaeologist, and you unearth a clay tablet that is inscribed with symbols. Clearly these symbols must have "meant" something to the people who created them, but what exactly was their function? Was it to record language -- speech, as with the Sumerian cuneiform script, with Ancient Chinese Oracle Bones, Mayan inscriptions, or the English text you are reading now? Or was it to record some other kind of information not tied to natural language, such as the deity symbols that were used in Babylonian property documents, which listed favored deities of the owner? Or Ikea assembly instructions, which are intended to be "readable" by speakers of any language? This issue is a real one: there are many old or ancient symbol systems whose function is largely or completely unknown to us. Examples include the Easter Island rongorongo inscriptions (19th century), the Pictish symbols of Scotland (6th century onwards), or the Indus Valley symbols (Northern India, Pakistan, 3rd millennium BCE). Short of providing a verifiable decipherment into one or more languages, thereby demonstrating that the symbols were written language, how can one know what kind of systems these were? Over the last few years, some papers have appeared in high-profile science publications that have argued that statistics of symbol combinations can provide clues to the answer. One paper, by Rajesh Rao (University of Washington) and colleagues at the TATA Institute and in Chennai, India appeared in 2009 in the journal Science. It argued that a particular statistical measure --- bigram conditional entropy --- showed that the Indus Valley symbols behave more like linguistic texts than non-linguistic systems. In another paper, Rob Lee and colleagues (University of Exeter), presented a more sophisticated set of measures that purported to show that Pictish symbols represented a language and were thus a form of writing. This paper appeared in the Proceedings of the Royal Society. Both of these papers (and other subsequent papers by Rao and his colleagues) received a large amount of attention from the press at the time. Though this was not necessarily the intention of the authors, in the press accounts these techniques were often presented as demonstrating that the symbol systems in question were written language. The present project provides evidence that the methods proposed in these earlier papers do not work. As part of this work, Sproat developed corpora of a variety of non-linguistic systems, both ancient and modern. These include Mesopotamian deity symbols (Babylonia), Totem Poles (Pacific Northwest), Pennsylvania barn stars ("hex signs", Pennsylvania), weather forecast icon sequences from www.wunderground.com, and Unicode character sequences in Asian emoticons. He compared them to corpora of fourteen languages, both ancient and modern, covering a variety of different writing-system types. From the point of view of the measures that had been proposed in the previous literature, all of the non-linguistic symbol systems in Sproat's collection behaved the same as the linguistic systems. However Sproat also found that one of Lee and colleagues' measures "Cr", but with different settings from the ones they published, plus a novel measure of the amount of local repetition, are quite accurate as discriminators of the two types of systems. The new setting of Cr was arrived at by training a regression tree on the new data. Note also that unlike Lee et al's setting for Cr, Sproat's version classifies Pictish symbols as non-linguistic, contradicting Lee et al's earlier work. These methods also classify Indus Valley symbols as non-linguistic. On the other hand, both of these measures turn out to be correlated with text length, and on balance non-linguistic systems tend to have shorter "texts" than written language. Despite these promising results, Sproat advises caution about relying on statistical measures in guiding one's analysis of unknown ancient symbol systems. All statistical measures are heavily influenced by, among other things, the size of the corpus, the length of texts, and what kind of text is involved: shopping lists have a very different statistical distribution than running prose. Sproat argues that the only really reliable demonstration that something is written language is a verifiable decipherment. What is clear, however, is that the previously proposed methods simply do not work for the intended purpose.

Project Start
Project End
Budget Start
2011-04-01
Budget End
2013-09-30
Support Year
Fiscal Year
2010
Total Cost
$130,140
Indirect Cost
Name
Oregon Health and Science University
Department
Type
DUNS #
City
Portland
State
OR
Country
United States
Zip Code
97239