An article by Hal Hodson published in New Scientist has something of a seemingly hyperbolic title, “Google wants to rank websites based on facts not links.”
But the very first paragraph of the Google Research paper cited by Hodson shows the headline to be fairly, well, factual. This is a research proposal to replace links with factual accuracy as a means of assessing a web page or web site’s trustworthiness.
The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy.
I’ve taken an initial look at the paper, called “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”, and there’s much about it that’s both compelling and contentious, as it as well sometimes offers inadvertent insights into other Google projects and priorities.
I threw my initial thoughts on Google+, and here expand on those thoughts and augment them.
Extraction errors and the Knowledge Vault
Perhaps unsurprisingly, it turns out that automating the fact extraction process has made it error-prone. They note that “extraction errors are far more prevalent than source errors. Ignoring this distinction can cause us to incorrectly distrust a website.”
Web resources without triples
Again unsurprisingly it turns out that a paucity of a data for a resource makes it difficult to assess that resrouce (suggesting that I might not have been entirely out on a limb in speaking of SEO with data).
What is somewhat surprising is, in turns out, just how many resources have so few extractable triples (and, interestingly, triples seems to be the measure of whether or not a given resource “has data” in the eyes of the Knowledge Vault).
This [assessment mechanism for automatically extracted facts] can cause problems when data are sparse. For example, for more than one billion webpages, KV is only able to extract a single triple (other extraction systems have similar limitations). This makes it difficult to reliably estimate the trustworthiness of such sources.
But apparently they’ve made progress in cracking both these nuts.
Our main contribution is a more sophisticated probabilistic model, which can distinguish between two main sources of error: incorrect facts on a page, and incorrect extractions made by an extraction system.
The main topic of a website, and other challenges
One of the identified area for improvements is in an approved ability to identify the main entity of a page.
To avoid evaluating KBT on topic irrelevant triples, we need to identify the main topics of a website, and filter triples whose entity or predicate is not relevant to these topics.
(On a side note, a mechanism identifying the main entity using schema.org has long been discussed, and recently proposed.)
On a related note, they say that in order to “avoid evaluating KBT on trivial extracted triples, we need to decide whether the information in a triple is trivial.”
All in all its evident that “not enough data” (not enough triples) and “too much data” (too many triples) are persistent problems at opposite ends of the data volume spectrum. Even if the also-identified goal of improving extraction capabilities results in “more triples, they may introduce more noise.”
Whither the Knowledge Graph after Freebase?
I’ll just mention, as I did on Google+, that the Knowledge Graph doesn’t just rely on Freebase classes, but replicates them exactly.
We used the Google Knowledge Graph (KG) (whose schema, and hence set of classes is identical to that of Freebase) to map cell values to entities, and then to the classes in the KG to which they belong.
So I’ll note, tangentially, that it’ll be interesting to see how it all works out for Google once Freebase is shuttered and Wikidata becomes the Knowledge Graph’s new BFF. Will classes simply then be derived from Wikidata, as they seemingly were from Freebase?
What’s a fact, Jack?
The opening two sentences of the abstract are remarkable in terms of what follow, insofar as the first speaks of “factual information”, and the second of “facts”, without any subsequent discussion of what constitutes a “fact”.
We do however, has this on the assessing the correctness of facts. Emphasis mine.
We extract a plurality of facts from many pages using information extraction techniques. We then jointly estimate the correctness of these facts and the accuracy of the sources using inference in a probabilistic model. Inference is an iterative process, since we believe a source is accurate if its facts are correct, and we believe the facts are correct if they are extracted from an accurate source.
Later on, when the paper describes advances in probabilistic modeling the researches have made, again the emphasis is on being better able to assess the source’s trustworthiness.
This provides a much more accurate estimate of the source reliability.
I hesitate to say much about the Open World Assumption, since I always seem to put my foot in it when I do, but it does seem to me to be worth mentioning in relation to the thrust of Knowledge-Based Trust.
The Open World Assumption, says Juan Sequeda, “is the assumption that what is not known to be true is simply unknown” whereas the Closed World Assumption “that what is not known to be true must be false.”
He goes onto say:
Recall that OWA is applied in a system that has incomplete information. Guess what the Web is? The Web is a system with incomplete information. Absence of information on the web means that the information has not been made explicit. That is why the Semantic Web uses the OWA. The essence of the Semantic Web is the possibility to infer new information.
Knowledge-Based Trust is certainly avails itself of the power of inference. In fact, the essence of what the researchers propose is “a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model.”
But does the model’s reliance on “accurate sources” in order to determine whether or not something is a fact itself assume the a priori existence of factually correct sources, rather in contradiction to the Open World Assumption? Again, I’ve probably put my foot in it, but if the Open World Assumption isn’t the elephant in the room here, the principle seems worth thinking about in regard to Knowledge-Based trust, whether in its application, disregard, or both.
Certainly the authors of the paper took the bull by the horn in more ways than one in using the citizenship of Barrack Obama as an example, since by preponderance of links and other signals one might incorrectly assess he was a Kenyan citizen.
And what about irony, or parodies, or inadvertent ironic self-parody?

But in whatever detail the paper describes a method of assessing the trustworthiness of a resource by using facts, it fails to squarely address the issue of just what a “fact” is to begin with.
Bernard Vatant has commented that Google conveys “(maybe unwillingly) the (very naive) notion that the Knowledge Graph stands at a neat projection in data of ‘real-world’ well-defined things-entities-objects and proven (true) facts about those.” (See also ensuing discussions.)
He says this in regard to the Knowledge-Based Trust paper.
Seems to me this paper, regardless of the intrinsic scientific quality and interest of the method and experiments – which I must confess I have not enough understanding to assess thoroughly – presents the same fundamental confusion I have previously pointed at in Google Knowledge Graph’s presentation prose. There again the meaning of “facts” is not clearly defined, and taken for granted.
If such a terminological vagueness is already borderline in general communication and marketing of the Knowledge Graph concept, it’s far more difficult to admit in the context of a scientific publication. The term “fact” seems clearly used to denote “statement” typically expressed as (subject, predicate, object) triple (as in RDF). But expressions such as “the correct value for a fact (such as Barack Obama’s nationality)” or “facts extracted by automatic methods such as KV may be wrong” show indeed a very loosy use of the notion of “correctness” or “truth” applied to “facts” which should be used in a scientific publication context with much more caution.
I don’t feel qualified to comment on the appropriateness of the paper’s use of the word “fact” in the context of a scientific paper, but it seems pretty glaring even to this relative layman that talking about an approach that relies on “the correctness of factual information” without addressing just what constitutes “factual information” is an pretty glaring epistemological error of omission.
Knowledge-Based Trust: not the whole shooting matching match
The other thing to note about the first paragraph of the papaer, cited above, is that it spoke of using links to evaluate the “quality of web sources” (emphasis mine).
So I don’t think the authors have suggested that links can’t be used for assessing web resources, only that KBT should replace them as mechanism for web quality. They might still, for example, be used for assessing relevancy, or freshness, or how viral-ness.
And it certainly doesn’t say that other factors shouldn’t be taken into consideration in the ranking of web resources in search results. Indeed the authors stress early that “source trustworthiness provides an additional signal for evaluating the quality of a website” and discuss “new research opportunities for improving it and using it in conjunction with existing signals such as PageRank.”
So even if fully embraced by Google Knowledge-Based Trust would never be the whole shooting match.
As the name “Knowledge-Based Trust” – described in the paper as a “trustworthiness score” – suggests, what the mechanism assesses is how much trust can be put in a resource, and there’s many more types of assessments that are made than trustworthiness when a search engine provides a query response.
Not to say that trustworthiness might not in itself be a, or the, determining factor in what Google responds with for a query.
Certainly conventional wisdom SEO would have it that links are a, or the, determining factor in what Google responds with for a query, so if link equity were to be supplanted by factual accuracy – PageRank by KBT – then we’d have to regard Knowledge-Based Trust as a very influential signal indeed.
Another imperfect measure of trust, or a link-killer?
I can well imagine the response of many to this proposal: we can all see that Google sometimes gets its facts mixed up in response to a query, so it’s crazy to rank web resources based on factual accuracy when Google is factually inaccurate. Or that Google relies too much on flawed Wikipedia for fact extraction and verification, so skews in favor of the biases or shortcomings evident there.
These are valid points about the methodology, and I’ve of course just above I’ve raised objections to the glib assumption that what constitutes a fact is self-evident.
But the fact (ha) is that hyperlink-based resource evaluation doesn’t itself reflect some objective reality. It is error-prone, subject to gaming and is its own special brand of subjective.
Would relying on the seeming veracity rather than its seeming popularity of a web page or web site result in better web results?
Obviously we wouldn’t know until we were able to compare those results. But for all the talk of how Google one day might no longer rely on links, this is a rare serious look at how Google might actually arrive at that future.