Search as a Service landscape mid 2017
Search as a Service is a type of SaaS for search, externally-provided search services enable you to have a full featured search engine available within your own software offerings so you can focus on your data and search UI rather than having to create and maintain all of the nuances in your own search engine.
In addition to discovering the tools on the market and looking at the value they offer, I'll also be doing the usual product comparison but with some of the architecture-centric concerns addressed also.
This is a non-exhaustive list of search solutions I've become familiar with that offer value in 2017;
- IndexDepot (built on Solr)
- Websolr (built on Solr)
- Elastic.co Elasticsearch
- Elastic.co Cloud (formerly Found)
- Azure Search
- AWS Elasticsearch (built on Elastic.co Elasticsearch)
- AWS CloudSearch
- Sphinx Search
With so much choice of software I'm going to focus on only those that are managed services, so a refined list of SaaS providers of search looks like this;
- Elastic.co Cloud (formerly Found)
- Azure Search
- AWS Elasticsearch
- AWS CloudSearch
With this list of SaaS search offerings, I'll first highlight features and do some comparisons but keep reading to get the analysis!
The first consideration with any SaaS is the languages they support, this will be your primary way to interact with your data so we usually want to be familiar so having the language we (or our team) knows well could be a huge consideration in terms of FTE resources and how fast we can build features.
❌ No SDK
🦄 Community SDK Available
IndexDepot uses plugins for services such as Heroku as I understand, pay to learn more!
Web services come in many forms but the buzzword REST gets thrown around a lot (and almost all are not actually a RESTful API), but I'll not get into that and just demonstrate the advertised function of the product. Most web services just enable you to search over your data, some allow you to project results, some offer endpoints that will ingest your data, some even offer management over your service.
|Web Service||RESTful||Secure||Ingestion||Management||JSON response||Protocol Buffer (gRPC)|
- Web Service simply refers to there being some sort of HTTP endpoint available at all
- Secure encompasses the use of securing your request data, either by HTTPS (encryption) rather than exposing your request data in the URL (which cannot be protected inherently)
- Confused by Proto3 / Protocol Buffers / gRPC? Basically REST is dead now and gRPC is the shiny new future. Read about it here
- Data ingestion is also interesting, there are several ways we get data into our database so it deserves a table;
REST is dead, gRPC is the shiny new future
Getting data into your database for searching.
It's important to mention that Inter-service ingestion refers to both migration from other services like MongoDB or S3, as well as other services such as Dropbox!
Crawlers are the same as web spiders, so you can imagine some of these crawlers would support their own SEO-esq tagging techniques and might even be suitable to crawl your structured data in forms like JSON or HTML tables from generated reports.
Now we are getting into some of the lower level concerns.
In terms of scalability, I will be looking at the service architecture design specifically, I don't want to misrepresent how we define scalability in terms of our own services just to be clear.
Distributed software cannot guarantee 100% consistency, this is best described by the CAP theorem stating a distributed system cannot be Consistent, Available, and Partitioned simultaneously. Where partitions refer to tolerance, we expect our distributed systems to be FT/HA (fault tolerant and highly available) so you'll find vendors offering 99.9% up-time to accommodate the CAP theorem case.
Distributed system cannot be Consistent, Available, and Partitioned simultaneously
The reoccurring theme when looking at how a service scales is the master/slave employed by services like the ones based Solr (via Zookeeper) versus shards/replicas we see in more modern solutions like Elasticsearch.
There are drawbacks with both, to be effective in horizontally scaling you want to look at sharding (shared nothing architecture) which inherently distributing data across many machines, but with this comes a common issue that we see in Elasticsearch's use of shards for writing and replicas for reading called the split-brain problem. Elastic.co Cloud makes certain efforts to protect its users from this issue, and Google's search openly states it has avoided this scenario altogether.
There is a new implementation Google has published on managing critical state in distributed systems, they call it Distributed Consensus but it is essentially the old-new again Paxos or a flavour thereof. I make a point of this because Sajari is the only search provider apart from Google that does this.
My conclusion here is to have battle tested horizontally scaling you would want to avoid anything that isn't using shards, but try to remain cautious about how replicas are promoted to shard status you might encounter the split-brain caveat and that is not fun.
This leaves us with the choice between Elastic.co Cloud, Google Search, Sajari, and potentially AWS managed services due to their mammoth scale and reputation.
Probably the key point, so I've left it to last to really drive it home.
Whether you have source data that is structured or unstructured, let's assume your data has been ingested and we are ready to start searching across it. There are several considerations here and just like scalability the comparison is a bit too complex to dump in a table, so I'll dot point now and expand upon them;
- Indexed structured search
- Wildcards, often called full-text search
- Fuzzy-search, or approximate string matching
- Faceted search
- Geo-search, using structured lat/lon data
- Boosting terms
Fuzzy-search is like an automated wildcard search, rather uses hashes which produce the same hash result when pieces are re-hashed, which is best described as automatic tokenization of term variants. Elasticsearch does an excellent job in their documentation of dealing with human language to describe how plurals and variants of words can be used to query the same documents. So swim, swimmer, swimming, swims, and even related terms like pool or freestyle could return you the same single document.
Just like fuzzy search is just good old full-text searching, so is AI-powered being just a buzzword for auto-boosting terms. While boosting will effectively demote or promote results that match a given query and still return said results, AI-powered often uses Bayesian inference (or a variant or other boosting algorithm) to find terms in plain text for you based on its own training datasets and continuous learning making it superior (or eventually) to tokenization boosting techniques alone.
Elasticsearch has made incredible advances above its rivals in terms of it out-of-the-box searching sugar, I still hold it's scan-and-scroll implementation of pagination in the highest regard, but without machine learning, I predict Elasticsearch will fail to be relevant.
AWS has two managed search options and neither offers modern searching functionality, they focus too heavily on service availability and very little on actual functionality users expect, and I wouldn't hope for AWS to release a third search offering to bridge this gap either.
If you are looking for a modern feature rich search engine that not only provides a reliable service but is capable of creating its own corpus of content for you, then Swiftype or Sajari meet the mark. For my next project, i'll be choosing Sajari because it is additionally capable of ingesting content programmatically (and it is written in Go!).
If you're already heavily invested in the Google, Azure, AWS cloud infrastructure you're getting reliability using their services but at a cost of functionality that ultimately your own development team will be forced to build for you, because your business and users will expect a certain level of completeness now that there are solutions that have more battle tested and modern features that anything less than awesome will reflect poorly on the developers ability (poor devs) and ultimately make your own new product seem immature and dated.