nlp如何搜索论文

This article originally appeared on Lemmalytica — a blog about language, artificial intelligence, and coding.

本文最初发表在 Lemmalytica上 -有关语言,人工智能和编码的博客。

Natural language processing (NLP) is a complex and evolving field. Part computer science, part linguistics, part statistics — it can be a challenge deciding where to begin. Books and online courses are a great place to start, and project-based learning is always a good idea, but at some point it becomes necessary to dig deeper, and that means looking at the academic literature.

Ñatural语言处理(NLP)是一个复杂的和发展的领域。 部分计算机科学,部分语言学,部分统计-确定从哪里开始可能是一个挑战。 书籍和在线课程是一个不错的起点,基于项目的学习始终是一个好主意,但是在某些时候,有必要进行更深入的研究,这意味着需要阅读学术文献。

Reading academic literature is an art unto itself, and just because a paper is popular doesn’t mean it’s the right place for a beginner. However, there is something to be said for papers that have withstood both the test of time and been widely accepted by experts. If a paper has been consistently cited in academic literature, then it’s probably fair to say that the paper is influential.

阅读学术文献本身就是一门艺术,仅仅因为论文很流行并不意味着它是初学者的正确地方。 但是,对于经受住时间考验并被专家广泛接受的论文,要说些什么。 如果一篇论文在学术文献中被一致引用,那么可以说该论文具有影响力是很公平的。

There are a variety of sources to find academic papers online, but one of the best is Google Scholar (GS), which helpfully provides citation data. We’re going to use this as our measure of influence. Unfortunately, GS doesn’t provide an API or other easy way to programmatically access data, so we manually downloaded the first 1,000 search results for the term natural language processing and then parsed and analyzed the data.

有多种资源可以在线查找学术论文,但最好的方法之一是Google Scholar (GS),它可以帮助提供引用数据。 我们将用它来衡量影响力。 不幸的是,GS没有提供API或其他简单的方法来以编程方式访问数据,因此我们手动下载了前1000个搜索结果,用于自然语言处理一词,然后对其进行了解析和分析。

If you’re interested in the code for this project, or want to play with the data yourself, check out the companion Jupyter notebook on GitHub.

如果您对此项目的代码感兴趣,或者想自己使用数据,请在GitHub上查看配套的Jupyter笔记本 。

数据探索 (Data Exploration)

Before we get started, let’s take a look at our data and see what we have to work with. In total, there are 973 papers in our dataset (after cleaning rows with missing data). For each row we have columns for the paper title, authors, blurb, citations, year, and link.

在开始之前,让我们看一下我们的数据并了解我们需要使用的数据。 在我们的数据集中,总计有973篇论文(用缺失的数据清除了行之后)。 对于每一行,我们都有相应的列,分别用于论文title , authors , blurb , citations , year和link 。

We have a lot of information to work with here, but unfortunately we don’t have the full abstracts or the full paper text. That will have to wait for a future project. Still, we can do a lot with the citation data alone. But is citation data the right metric? Papers that were published a long time ago have an advantage because they have had more time to gather citations. Let’s add a citation_rate column to show how many citations a given paper has received per year since publication.

在这里,我们有很多信息可以使用,但是不幸的是,我们没有完整的摘要或全文。 那将不得不等待未来的项目。 不过,仅引用数据我们就可以做很多事情。 但是引文数据是正确的指标吗? 很久以前发表的论文有其优势,因为它们有更多的时间收集引文。 让我们添加一个citation_rate列以显示自发表以来,给定论文每年收到多少次引用。

Great! Now we have at least two metrics to use when judging how influential a particular paper has been in the world of NLP: total citations; and, citation rate since year of publication.

大! 现在,在判断特定论文在NLP领域的影响力时,至少有两个指标可以使用:总引文数; 以及自出版年份以来的引用率。

随着时间的推移NLP研究 (NLP Research Over Time)

Before we start exploring individual papers it would be useful to get a high-level view of our data. When were the most influential NLP papers produced? How has that trend changed over time? Let’s plot the production of NLP papers by year and see how things look.

在我们开始研究各个论文之前,对我们的数据有一个高层次的了解会很有用。 何时出版了最具影响力的NLP论文? 随着时间的流逝,这种趋势如何变化? 让我们按年份绘制NLP纸的产量图,看看情况如何。

NLP papers are definitely proliferating over time! Our data only represents a small snapshot of all NLP papers, but even here we can see the number of papers per year trending upwards since the mid-1970s.

NLP论文肯定会随着时间的推移而激增! 我们的数据仅代表所有NLP论文的一个小快照,但是即使在这里,我们也可以看到自1970年代中期以来每年的论文数量呈上升趋势。

We need to be cautious here though, just because more papers are being produced doesn’t necessarily tell us about where the influential periods of production occur. It does tell us that NLP is growing in popularity, which is in itself an interesting trend. Perhaps we can get a better idea of the influence aspect by looking at citation counts and citation rates over time.

不过,在这里我们需要谨慎,因为正在生产更多的纸张并不一定告诉我们影响生产时期的位置。 它确实告诉我们,NLP越来越受欢迎,这本身就是一个有趣的趋势。 通过查看一段时间内的引用次数和引用率,也许我们可以对影响方面有一个更好的了解。

Number of citations per year and citation rates per year look reasonably steady, but there are some interesting outliers. What happened in 1999? It looks like that was a banner year for NLP. Perhaps we will find an answer as we continue to look at the data.

每年的引用次数和每年的引用率看起来相当稳定,但是存在一些有趣的异常值。 1999年发生了什么事? 对于NLP来说,这似乎是标志性的一年。 当我们继续查看数据时,也许我们会找到答案。

有影响力的论文 (Influential Papers)

Now that we have an idea of the broader trend in production of NLP papers, let’s get to our key question. What are the most influential papers and books? What should you read if you want to learn about NLP?

现在我们对NLP纸的生产趋势有了更广泛的了解,让我们解决关键问题。 什么是最有影响力的论文和书籍? 如果您想了解NLP,应该阅读什么?

An obvious place to start is to look at which papers have the most citations. Generally, if a paper is widely cited in academic literature, we can reasonably say that it has been an influential paper. Let’s take a look at the top 10 most cited papers.

一个显而易见的起点是查看哪些论文被引用最多。 通常,如果一篇论文在学术文献中被广泛引用,我们可以合理地说这是一篇很有影响力的论文。 让我们看一下被引用次数最多的10篇论文。

The clear leader in the citation count is Foundations of Statistical Natural Language Processing (FSNLP) by C Manning and H Schutze, which has 13,929 citations — more than double the next contender. FSNLP was published in 1999, which would appear to solve the mystery outlier from the high-level look we took at the data earlier. If we’re using citation count as our metric for influence, this data would imply that the following are the five most influential NLP papers:

C Manning和H Schutze所著的统计自然语言处理基础(FSNLP)是被引次数明显的领先者,被引13,929次,是下一竞争者的两倍多。 FSNLP于1999年发布,从我们之前对数据进行的高级分析来看,它似乎可以解决这个离群值。 如果我们使用引文计数作为影响力的度量标准,则此数据将暗示以下是NLP最具影响力的五篇论文:

  • Foundations of Statistical Natural Language Processing by C Manning and H Schutze, with 13,929 citations;
  • Natural Language Processing (almost) from Scratch by R Collobert, J Weston, L Bottou, and M Karlen, with 6,484 citations;
  • The Stanford CoreNLP Natural Language Processing Toolkit, by CD Manning, M Surdeanu, J Bauer, and JR Finkel, with 5,409 citations;
  • Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, by S Brid, E Klein, and E Loper, with 5,304 citations; and,
  • A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, by R Collobert and J Weston, with 4,862 citations.

As we noted earlier though, older papers have an advantage over newer papers because they have had more time to generate citations. Let’s look at our other metric, the yearly citation rate, to see if we get different results.

如前所述,较早的论文比较新的论文有优势,因为它们有更多的时间来产生引用。 让我们看看我们的另一个指标,即年度引用率,看看是否得到不同的结果。

Our top five papers by citation rate are almost the same as the top papers by total citations, with a slight re-ordering. But, we have a new entrant to the mix, and moreover, a new winner! Natural language processing by KR Chowdhary has the highest citation rate with 863 citations per year. Moreover, the paper was published in 2020, which means the year isn’t even up!

我们按引用率排名前五的论文与按总被引用率排名前几的论文几乎相同,只是略有重新排序。 但是,我们有一个新的参与者,而且还有一个新的赢家! KR Chowdhary对自然语言的处理被引用率最高,每年被引用863次。 此外,该论文发表于2020年,这意味着这一年甚至还没有结束!

Looking at the blurb doesn’t tell us much about why this paper has been so popular, but if we follow the link to the full abstract we see that the paper is actually a chapter from KR Chowdhary’s book Fundamentals of Artificial Intelligence. Perhaps this tells us something about the trend of NLP in general as we move from linguistic analysis to artificial intelligence applications. KR Chowdhary is a professor of computer science at Jodhpur Institute of Engineering & Technology, and based on our data, it would seem that he is one of the most influential figures in NLP and AI today.

看一下摘要并不能告诉我们为什么这篇论文如此受欢迎,但是如果我们跟随完整摘要的链接,我们会发现该论文实际上是KR Chowdhary的《人工智能基础》一书中的一章。 当我们从语言分析转向人工智能应用程序时,这也许可以告诉我们有关NLP总体趋势的一些信息。 KR Chowdhary是焦特布尔工程技术学院计算机科学教授,根据我们的数据,他似乎是当今NLP和AI中最有影响力的人物之一。

Speaking of influential individuals, that seems like a good next step in our exploration.

说到有影响力的个人,这似乎是我们探索的下一步。

有影响力的作者 (Influential Authors)

One of the great things about NLP is that it benefits from the expertise of all sorts of people, ranging from computer scientists, to linguists, to statisticians, and more.

NLP的一大优点是,它受益于从计算机科学家,语言学家到统计学家等各种人员的专业知识。

To understand who the most influential authors are, let’s start by looking at who is the most prolific. While we’re at it, let’s see how many unique authors were involved in writing our 973 influential papers.

要了解谁是最有影响力的作者,让我们从谁最有才华开始。 在讨论过程中,让我们看看有多少独特的作者参与了我们973篇具有影响力的论文的撰写。

It looks like 1,937 authors contributed to our 973 papers, which makes sense since most papers have multiple authors. The most prolific author, with 22 papers, is C Friedman. So is C Friedman the most influential author? Maybe the paper titles will give us a clue.

大约有1,937位作者为我们的973篇论文做出了贡献,这很有意义,因为大多数论文都有多位作者。 最多的作家,有22篇论文,是C弗里德曼。 那么,弗里德曼(C Friedman)是最有影响力的作家吗? 也许论文标题会给我们一个线索。

It looks like most of C Friedman’s papers are related to medical issues and bioinformatics. Unfortunately, Google Scholar doesn’t tell us much about who the authors are, but a quick Internet search shows that C Friedman is Professor Carol Friedman who is a professor of biomedical informatics at Columbia University. And with so many papers to her name (several of which have hundreds of citations), she sure looks like an influential author! If you’re interested in biomedical informatics, then she sounds like a good author to start with.

似乎C Friedman的大多数论文都与医学问题和生物信息学有关。 不幸的是,Google Scholar并没有告诉我们很多有关作者的信息,但是快速的互联网搜索显示C Friedman是哥伦比亚大学生物医学信息学教授Carol Friedman教授。 有了这么多以她名字命名的论文(其中有几百篇被引用),她肯定看起来像是一位有影响力的作家! 如果您对生物医学信息学感兴趣,那么她听起来像是一位很好的作家。

Of course, quantity of papers alone isn’t the sole indicator of influence. What happens if we look at a different measure? Rather than evaluate total papers, let’s see which author has the most citations.

当然,仅纸张数量并不是影响的唯一指标。 如果我们看另一种方法会怎样? 与其评估全部论文,不如看看哪个作者被引用最多。

The author with the most citations is C Manning with a staggering 13,960 citations in our data. (You may recognize the name from our earlier look at most cited papers.) As we look at the top 10 authors by total citations, an interesting conundrum appears. C Manning is the top author, but another top author is CD Manning. Are they the same person? Let’s look at the papers.

被引用次数最多的作者是C Manning,其数据被惊人的13,960次引用。 (您可能会从我们之前在大多数引用的论文中看到的名字来识别这个名字。)当我们按总被引次数排名前十位的作者时,就会出现一个有趣的难题。 C Manning是第一作者,但另一位顶级作家是CD Manning。 他们是同一个人吗? 让我们看一下论文。

Once again, we’re going to have to do a bit of research to determine whether C Manning and CD Manning are the same person. Now that we have the titles of the papers, a quick web search should reveal the information we need. And indeed, we can see that all four papers are co-authored by Professor Christopher D. Manning, who is a professor of machine learning, linguistics, and computer science at Stanford University and the Director of the Stanford Artificial Intelligence Laboratory. Prof. Manning is also the author of Introduction to Information Retrieval, which did not come up in our search for “natural language processing”, but is arguably his most influential text with 18,934 citations. Even though Prof. Manning only has four papers in our list (plus many more not on our list), his citation count is far and away the highest. If we’re looking for the most influential NLP expert, it would seem that Manning is a good candidate.

再一次,我们将需要做一些研究以确定C Manning和CD Manning是否是同一个人。 现在我们有了论文的标题,快速的网络搜索应该可以找到我们所需要的信息。 确实,我们可以看到所有四篇论文都是由斯坦福大学机器学习,语言学和计算机科学教授,斯坦福大学人工智能实验室主任Christopher D. Manning教授合着的。 曼宁教授还是《信息检索导论》的作者,该导论并未出现在我们对“自然语言处理”的搜索中,但可以说是他最有影响力的文本,被引用18,934次。 尽管曼宁教授的名单上只有四篇论文(还有很多不在我们的列表中),但他的引用数却是最高的。 如果我们正在寻找最有影响力的NLP专家,那么曼宁似乎是一个不错的选择。

Of course, this raises the question of whether our list has other instances of authors that are listed under multiple different names. The answer is almost certainly yes. However, given that we have nearly 2,000 different authors, for now we’ll set that question aside rather than try to research all of them.

当然,这引起了一个问题,即我们的列表中是否还有其他作者实例以多种不同的名称列出。 答案几乎可以肯定是。 但是,考虑到我们有近2,000位不同的作者,现在我们将这个问题放在一边,而不是尝试研究所有的作者。

向专家学习 (Learning from Experts)

If the number of NLP papers being published each year is any indication, it would seem that NLP is a growing field. Here, we have only scratched the surface, both in terms of number of papers we analyzed and type of analysis. However, if you’re interested in learning NLP, perhaps this data will provide you with a useful starting point. So, get reading, and let us know what you think!

如果每年发表的NLP论文数量有任何迹象,那么NLP似乎是一个正在发展的领域。 在这里,无论是从分析的论文数量还是分析类型上,我们都只是划痕。 但是,如果您对学习NLP感兴趣,也许此数据将为您提供有用的起点。 因此,请阅读并告诉我们您的想法!

随身笔记本 (Companion Notebook)

For the companion Jupyter Notebook to this article, including all of the code and original data, see Lemmalytica Notebooks on GitHub.

有关本文附带的Jupyter Notebook,包括所有代码和原始数据,请参阅GitHub上的Lemmalytica Notebook 。

(For more NLP Articles and News…)

Lemmalytica is a blog about language, artificial intelligence, and coding. For more articles, references, tutorials, and stories about natural language processing and related fields, check it out!

Lemmalytica是一个有关语言,人工智能和编码的博客。

翻译自: https://medium.com/@severinperez/the-most-influential-nlp-papers-on-google-scholar-9df707f55259

nlp如何搜索论文