Experienced SEO’s know that in order to know what Google is up to, you need to read and dissect the patent’s they file. Below is a summary of important points of Google’s newly-granted patent.
Before we dig into this latest Google patent, which was granted on December 22, 2016, let’s first define an entity to make sure we’re all on the same page. According to the patent, the definition is as follows:
[A]n entity is a thing or concept that is singular, unique, well-defined and distinguishable. For example, an entity may be a person, place, item, idea, abstract concept, concrete element, other suitable thing, or any combination thereof.
To keep things simple, you can casually think of an entity as a noun.
Another definition that will be important to understand is unstructured data, which is pretty accurately defined in Wikipedia as such:
Unstructured data … refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner.
With that under our belt, we’re going to dive right into the patent. The way this article will be structured, I will be including the exact verbiage of important sections of the patent in italics, followed by an explanation of what each section means.
Methods, systems, and computer-readable media are provided for collective reconciliation. In some implementations, a query is received, wherein the query is associated at least in part with a type of entity. One or more search results are generated based at least in part on the query. Previously generated data is retrieved associated with at least one search result of the one or more of search results, the data comprising one or more entity references in the at least one search result corresponding to the type of entity. The one or more entity references are ranked, and an entity result is selected from the one or more entity references based at least in part on the ranking. An answer to the query is provided based at least in part on the entity result.
This is one of the abstracts that does little to describe the full scope of what’s contained in the patent. As far as the abstract is concerned, all we’re about to read is that entities get ranked, and that ranking determines the answer to a query.
This was enough to draw me into the patent, and it is indeed accurate — but as you’ll soon see, there’s a lot more described within than a simple “we rank nouns.”
The following excerpts are contained within the summary section of the patent.
[A] system provides answers to natural language search queries by relying on entity references identified based in the unstructured data associated with search results. … [T]he system retrieves additional, preprocessed information associated each respective webpage of at least some of the search results … the additional information includes, for example, names of people that appear in the webpages. In an example, in order to answer a “who” question, the system compiles names appearing in the first ten search results, as identified in the additional information. The system identifies the most commonly appearing name as the answer …
In the excerpt above, we start to see the method behind the system. What Google is discussing here is the idea that to determine the answer to a “who” question, they would use the most common name appearing across the top 10 search results.
[T]he query is a natural language query … ranking the one or more entity references comprises ranking based on at least one ranking signal. In some implementations, the one or more ranking signals comprise a frequency of occurrence of each respective entity reference. In some implementations, the one or more ranking signals comprise a topicality score of each respective entity reference. In some implementations, the previously generated data corresponds to unstructured data.
To further the information on how the approach is outlined in the patent, we see the frequency of use of the term within a document, and presumably across multiple documents. In addition, we see that topicality is a relevancy factor and that this is a method applied to unstructured data.
[Q]uestions may be provided for queries in an automated and continuously updated fashion. In some implementations, question answering may take advantage of search result ranking techniques. In some implementations, question answers may be identified automatically based on unstructured content of a network such as the Internet.
In this section, we see it reinforced that the answers to questions may be determined based on search results or ranking techniques, but it appears we’re also seeing the patent expand to include the automated determination of question answers based on other techniques and their ability to determine that answer in unstructured data.
The real meat of Patent US 2016/0371385 A1
Sections 14 through 96 give detailed descriptions of the images, flowcharts and the real meat included with this patent. Some of the images will be included below and some will simply be noted, depending on which will get across the information better.
[T]he system may retrieve entity references associated with the top ten search results. … the ranking and/or selecting is based on a quality score, a freshness score, a relevance score, on any other suitable information, or any combination thereof.
Here, we see Google clarifying that different types of entities and answers may be based on different sets of information. For example, freshness may be selected as a stronger signal if you were looking up the weather, whereas quality may be stronger if you were looking up a definition, health information and so on.
I’ll admit it, I had to read this section a couple of times to fully grasp what they were talking about. This section relates to patent Figure 1, which is as follows:
[T]he information retrieved from entity references 110 associated with a particular webpage is a list of persons appearing in that webpage. For example, a particular webpage may include a number of names of persons, and entity references 110 may include a list of the names included within the webpage. Entity references 110 may also include other information. In some implementations, entity references 110 includes entity references of different types, for example, people, places, and dates. In some implementations, entity references for multiple entity types are maintained as a single annotated list of entity references, as separated lists, in any other suitable format of information, or any combination thereof. It will be understood that in some implementations, entity references 110 and index 108 may be stored in a single index, in multiple indices, in any other suitable structure, or any combination thereof.
The idea behind what they’re referring to here is repeated elsewhere in the patent. One of the big issues that occurred to me while reading this patent is the enormous processing power it would take. If for any entity search the engine needed to run a query on its own index, process top 10 results, and then determine which terms are used most often in order to establish the most likely answer to a question, the processing of a search result like this would take many times more resources.
In section 20, they discuss the method around this, which is to pre-populate reference lists (110 in the diagram) separate from the index itself.
So, when a query like “who is dave davies” is entered, the data is drawn from the index (to determine the possible pages that have the answers), but a second reference point (110) also exists that would contain the entity data (such as how many times “dave davies” is mentioned in each document), thus saving Google from needing to figure it out on the fly.
[O]ne or more ranking metrics are used to rank the entity references, including frequency of occurrence and a topicality score. Frequency of occurrence relates to the number of times an entity reference occurs within a particular document, collection of documents, or other content. Topicality scores include a relationship between the entity reference and the content in which it appears.
Setting aside the repetition of the use of the number of times a term is used as a metric, we also see in this section a reinforcement of topicality. While this could relate to the relevance of a site to a subject and the weighting a reference should be given, I tend to believe it has more to do with aiding in understanding which entity is being referenced.
For example, if the entity “dave davies” is seen on a page related to SEO, then it is likely me. On the other hand, if “dave davies” appears on a page related to music, it’s likely “that Kinks guy” (as I like to refer to him).
Reading patents is a lot to take in. Read over it a couple of times to make sure you understand what Google is trying to do.