Why was Dr. John Snow designated the "Father of Epidemiology?" His painstaking investigations of the outbreaks of deadly cholera in London in the 1850s led him to conclude that the disease was caused by contaminated water. His meticulous data gathering pinpointed the source at a single water pump.
Not only had no one ever mapped the incidence of death before, but even the concept of the "germ theory" was still discredited. It took almost twenty years for the scientific and medical profession to accept his premise, but since the water pump was disabled, the cholera epidemic ceased. (See map at end of the piece).
Why did a commercial NLP company, John Snow Labs, choose this name? Though not exclusively producing models for healthcare and life sciences, that is a significant part of their business. I've had the chance to speak with them on several occasions, and they are a remarkable organization, in many ways, which I'll explain. But first, let's review how at least some aspects of NLP work.
By now, everyone is familiar with conversational NLP like Siri. For augmented analytics, the conversation may be, "Download the latest pricing analysis to my phone." The critical thing to remember is that the computer does not understand what you are saying, nor does it understand what it is saying. It can process it and answer, but make no mistake; it's all done with math.
Organizations that offer NLP capabilities do not start from scratch. There are open source libraries that can slot in and wrap their software around it, such as Spark NLP from John Snow Labs, for example. Or other open-source Python libraries such as spaCy, textacy, or nltk. Just to be clear, here are the steps an NLP goes through to satisfy your question. It isn't one model - "Parse my sentence." Each step is a different model. I'm oversimplifying here, but to give you a sense of how part of this works, here are the steps to "understand" a sentence:
- First, sentence segmentation, break the words apart
- Word Tokenization: single or groups of words = tokens
- Predict the part of speech for each token. Feed the token with some surrounding tokens for context into a trained part of speech classifier
- Text Lemmatization: know the base form of every word and its inflections; finding the most basic form of every word, i.e., plural words become singular, conjugated verbs become base verbs, and so on.
- Identify "stop" words (such as a, an, the, …) and filter them out because they don't mean much in the sentence.
- Dependency parsing: extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between "head" words and words, which modify those head words.
- Find noun phrases: groups of words that talk about the same things.
- NER (Named Entity Recognition): Detect and label nouns to real-world concepts. Names of people, companies, geolocation, dates and times, Amounts of money, names of events, etc.
- Coreference resolution: finding all expressions that refer to the same entity in a text and attach meaning to words like pronouns, or it
Consider that John Snow Labs offers a community version (free) of its Spark NLP that supports an astounding 375 languages, some of which have fewer than 10,000 native speakers. The first question is how and the second question is why. The how is pretty complicated, and I'll save that for another article. But it involves training using deep learning techniques, but the why is pretty compelling.
John Snow Labs is a commercial company focused on Life Sciences, Genomics, and Healthcare. Unlike IBM's proclamation ten years ago that Watson would cure cancer (and failed), John Snow Labs set out to use NLP technology to assist practitioners in assembling credible medical records that are, to this day, scattered, siloed, and inconsistent.
Particularly with oncology, this is crucial because cancer treatment is still very complicated, and practitioners need all the data they can get. hen data is cloistered in multiple EMRs, John Snow Labs frees it. But why 375 languages? As David Talby, the company's founder and CTO said to me recently, accuracy in B2C transactions is useful, but it's not a matter of statistics in oncology. Everyone single person is important, whether they're at Mount Sinai Hospital or a Doctors Without Borders camp.
You may wonder, if these models aren't "smart" in any human intelligence fashion, how can you trust them? Alter all, human language is very complex, often ambiguous if not nonsensical. The answer is that a few years ago, the accuracy of NLP models hovered around 50%. Today, Spark NLP achieves better than 95% accuracy in academic peer-reviewed results.
We have lots of problems with "AI" companies, especially those with venture funding, expected to exhibit the growth their investors demand. As a result, ethical considerations about the products they produce take a severe hit. John Snow Labs is not in that category:
- They are making significant open-source contributions with a permissive license, democratizing state-of-the-art NLP for a global community.
- Adding support for many languages around the world, well beyond their revenue-generating markets, supporting a more diverse AI community, a crucial element missing in most AI companies.
- Privacy: their community, users, and customers do not share any data with them, and they don't believe they should. Their software is designed from the ground up to respect security, privacy, and residency laws worldwide.
- They have never raised capital - and hence never made promises driven by incentives to achieve very high margins, profits, or growth rates. This is a strategic point - a lot of "bad" behavior is caused by companies having to hit financially aggressive goals to survive, which sooner or later requires their customers and community to pay more and shirk their ethical responsibilities in this highly sensitive and impactful technology.
Why is it so crucial for John Snow Labs to have these policies and enforce them? The AI industry is riddled with ethical problems. Many companies engage in a sinister practice, "Ethics washing," fabricating or exaggerating their commitment to equitable AI. It's inauthentic and distracts from whether or not actual steps are being taken toward building a world where professional standards demand AI that works just as good for women, people of color, or young people as it does for the white men who make up the majority of people making AI systems.
Training in ethics has not been very effective, at least partly because it's been aimed at AI developers and researchers who make important determinations that can harm people. In contrast, they need to know when the technology benefits and harms. It is clear that better testing and engineering practices, grounded in concern for AI's implications, are urgently needed.
However, focusing on engineers without accounting for the broader political economy within which AI is produced and deployed runs the risk of placing responsibility on individual actors within a much larger system, erasing very real power asymmetries. Those at the top of corporate hierarchies have much more power to set direction and shape ethical decision-making than individual researchers and developers. Racism and misogyny are treated as "invisible" symptoms latent in individuals, not as structural problems that manifest in material inequities. These formulations ignore that engineers are often not at the center of the decisions that lead to harm and may not even know about them. For example, some engineers working on Google's Project Maven weren't aware that they were building a military drone surveillance system. Indeed, such obscurity is often by design, with sensitive projects being split into sections, making it impossible for anyone developer or team to understand the ultimate shape of what they are building and where it might be applied.
In January 2021, John Snow Labs released NLU 1.1, which integrates 720+ new models from the latest Spark-NLP 2.7 release. Including state-of-the-art results with Sequence2Sequence transformers on problems like text summarization, question answering, translation between 192+ languages, and extracted Named Entity in various Right to Left written languages like Arabic, Persian, Urdu, Hebrew, and languages that require segmentation like Korean, Japanese, Chinese, and many more in 1 line of code. These new features are possible because of integrating Google's T5 models and Microsoft's Marian models.
NLU 1.1 has over 1,000 pertained models. In addition to this, NLU 1.1 comes with nine new notebooks showcasing training classifiers for various review and sentiment datasets and seven notebooks for the new features and models. You can browse the complete list of models in this release.
I'll sum it up this way. Facebook is the world's largest deliberate purveyor of disinformation. A company with, in my estimation, no soul. John Snow Labs is a small commercial NLP company of roughly 75 employees that provides an open source library with hundreds of pre-trained models, including tools, in contrast to Facebook, for detecting disinformation.
John Snow's original cholera data points map.