Google Scholar is Filled with Junk
The University Library of the University of Illinois describes Google Scholar as a “freely accessible search engine that lets users look for both physical and digital copies of articles”. They go on to describe some of the advantages of using Google Scholar but also suggest that there are better alternatives to Google Scholar and encourage students to closely evaluate their sources.
In November of 2014, Jeffrey Beall, a librarian at the University of Colorado, published “Google Scholar is Filled with Junk Science” on his blog pointing out what he called a major flaw in Google Scholar’s comprehensive indexing strategy: the inclusion of predatory publishers. These are publishers who “perform a fake or non-existent peer review”
A 2009 Library Journal article, “Google Scholar’s Ghost Authors”, criticizes Google Scholar for including Ghosts in the Machine, i.e., false names and authors which are the manifestation of the Google Scholar parsing engine. The article includes a few examples, “N Subscriber” is short for “New Subscriber” which they point out as being cited as an author on Google Scholar. These examples no longer return valid results on Google Scholar but it appears to be the result of a quick fix instead of addressing a broader problem within their parsing engine.
In this article, we take a closer look at a problem within the Google Scholar parsing engine that has been lightly touched upon by the authors cited above but does not seem to be taken very seriously by the the Google Scholar team. Now seems like a good time to take a look at it since Google Scholar is not only indexing predatory publishers, or false journals that create empty articles to make themselves look like a journal, instead Google Scholar is being targeted by bad players in the porn industry. Something that they seem to be getting pretty good at.
The problem is that Google Scholar’s indexing policy is so comprehensive that it does not properly assess the source it is dealing with. Great approach if you want to just build a massive index, but not so great if you’d like to keep your index pristine and free from abuse.
If you’re a rogue player intent on scoring a free source of traffic, the challenge is to index your site on scholar.google.com. Unfortunately, as we will show today, it appears that more than just one or two players have already successfully undertaken this challenge.
The method to get indexed by Google Scholar appears to be trivial. If you have a basic understanding of how the original corpus was built, or how you yourself would undertake a project of this nature, then you shouldn’t have a problem engineering a solution here. But if you don’t, well here’s how others are doing it.
We start by looking at a legitimate source, in this example we will use the Association for the Advancement of Artificial Intelligence. The article on “Life in the Fast Lane – The Evolution of an Adaptive Vehicle Control System” is a great read, for sure, but for the purpose of this piece we’re more interested in the URL where it lives:
The path highlighted in red is constructed as a result of using the Open Journal System, a journal management and publishing system developed by the Public Knowledge Project. If you take a look through their demo site, you will notice that the underlying link structure is the same. So now you know it, and so does the Google Scholar parsing engine, for it appears that this link structure is part of the criteria used when assessing if a site should be included.
They’re basically saying “Is this link an article suitable for indexing?”, to which a module somewhere is replying “Well it follows the OJS link structure so if you can find attributes within the page that resemble the characteristics of an article, say a name and a title, then I’m happy to include this as an article”
Trust appears to be delegated based solely upon the structure of the URL upon which the article was discovered. The Google Scholar parsing engine must have a collection of regular expressions that capture the linking structures of systems the likes of OJS and that is effectively their barrier to entry.
It can’t be that simple can it? You wouldn’t think so, but when you take a close look at some of the sites being indexed by Google Scholar, they are so clearly classed as hardcore pornography that the path of URL itself is one of the very few attributes not in the hardcore porn class.
I asked Google Scholar to give me a list of all publications by the author B Boobs. Incidentally, this manifestation of ghost authors is precisely the problem highlighted by the Library Journal’s article in 2009.
The URL to the article “All Porn Tube Categories” is http://porn-style.com/index.php/eee/article/viewFile/680/626, note the link structure. It goes without saying that the site hosting the article in question, porn-style.com, is hardcore pornography.
I’ve included a simple data flow diagram illustrating how data from the hardcore porn site is parsed and then indexed by the Google Scholar parsing engine into the applicable fields of an article citation.
- The article title from this example appears to have been picked up because of the text that was all in uppercase. I’ve seen other examples where they pick up the text from the H1 HTML tag
- Valid author names are the inner text of the hardcore porn categories, which are always elements containing more than one word. Note how many single word porn categories are omitted, but the two word categories right next to them are included.
And The Winners Are
SEO experts have not let this one go. If you spend some time perusing Google Scholar you will find that all types of junk have tainted their index. Not only is Google Scholar a great index for legitimate articles, but it’s also a great way to find porn. I wonder if Google Scholar enjoys any exemption from the content moderation policy of institutions the likes of schools. If it did, I’d imagine the administrators would not be too happy about the nature of some of the content indexed.
This isn’t just a little trick used by one or two porn sites. Even through casual inspection, turns out quite a few hardcore porn sites are exploiting this. Here are the top 50 I found, sorted by their Alexa rank.
The Real Problem
The real problem is that when you’re dealing with a system responsible for indexing the collected peer-reviewed works of humankind, it’s probably a good idea to raise the barrier to entry and not depend on the structure of an URL as to whether or not an “article” should be included. This is not an insurmountable problem. If you’re in the abuse field then you already have half a dozen mitigations on the tip of your tongue. So why have they not been implemented? After all, Google Scholar has been aware of this problem for years, direct from Beall’s earlier article: “Google scholar does not sufficiently screen for quality and includes much junk science. To remain relevant and valuable. Google Scholar needs to limit the database to articles from authentic and respected scholarly publications and exclude articles from known publishers of junk science”
Here’s what will happen next: somebody from Google will read this article and pass it onto a lass/chap in the Google Scholar team. Upon reviewing this article for themselves, and to avoid further embarrassment, they will come up with a quick fix to to ensure that the examples and sites cited here no longer work (much like the 2009 article). They will then add a request to fix the more serious underlying issue of trust delegation and the low barrier to entry into the system. This may or may not be undertaken, if I had to put money down I would bet on the latter since in all likelihood this is a feature that has probably already been requested by a few others.
In any event, if they don’t address the underlying problem in a meaningful way, it won’t be long before Google Scholar is dealing with an issue far more serious than the inclusion of hardcore pornography in their index.« Amazon Third Party Ads Cont.Amazon Prime Day »