NLPExplorer

Exploring the Universe of NLP

Papers 55,565 Authors 39,555 Citations 723,976



Dataset




Sample Paper Record


{
    "_id": "D14-1162",
    "acl_id": "D14-1162",
    "paper_title": "Glove: Global Vectors for Word Representation",
    "ocr_title": "GloVe: Global Vectors for Word Representation",
    "publisher": "Association for Computational Linguistics",
    "sig_name": "SIGDAT",
    "venue_name": ["EMNLP"],
    "year": "2014",
    "month": "October",
    "address": "Doha, Qatar",
    "volume_name": "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    "pages": "1532–1543",
    "doi": "10.3115/v1/D14-1162",
    "acl_author_id": [
        "jeffrey-pennington",
        "Richard Socher",
        "Christopher Manning"
    ],
    "acl_author_name": [
        "Jeffery Pennington",
        "richard-socher",
        "christopher-d-manning"
    ],
    "ocr_author": [
        {"email": "richard@socher.org", "name": "Richard Socher"},
        {"email": "manning@stanford.edu", "name": "Christopher D. Manning"},
        {"email": "jpennin@stanford.edu", "name": "Jeffrey Pennington"}
    ],
    "figure_caption": [
        "Figure 1: Weighting function f with = 3/4.",
        "Figure 2: Accuracy on the analogy task as function of vector size and window size/type. All models are trained on the 6 billion token corpus. In (a), the window size is 10. In (b) and (c), the vector size is 100."
    ],
    "table_caption": [
        "Table 1: Co-occurrence probabilities for target words ice and steam with selected context words from a 6 billion token corpus. Only in the ratio does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.", 
        "Table 2: Results on the word analogy task, given as percent accuracy. Underlined scores are best within groups of similarly-sized models; bold scores are best overall. HPCA vectors are publicly available",
        ...
    ],
    "footnote": [
        "1 We could also include bias terms in Eqn. (16). ",
        "2 http://lebret.ch/words/ 3 http://code.google.com/p/word2vec/ 4 Levy et al. (2014) introduce a multiplicative analogy evaluation, 3COSMUL, and report an accuracy of 68.24% on ", 
        ...
    ],
    "url": [
        "http://lebret.ch/words/",
        "http://code.google.com/p/word2vec/"
    ],
    "domain": ["Lebret", "Google"],
    "ref_ACLcode": [
        "P12-1092",
        "E14-1051",
        ...
    ],
    "reference_cit2ref": [
        {
          "reference": "Tom M. Apostol. 1976. Introduction to Analytic Number Theory. Introduction to Analytic Num- ber Theory.",
          "cit2ref": "Apostol, 1976"
        },
        {
          "reference": "Marco Baroni, Georgiana Dinu, and Germ  an Kruszewski. 2014. Dont count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL.",
          "cit2ref": "Baroni et al. (2014)"
        },
        {
          "reference": "Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning.",
          "cit2ref": "Bengio, 2009"
        },
        ...
    ]
    "task": [
        "namedentityrecognition",
        "semanticsimilarity",
        "informationretrieval",
        "questionanswering",
        "textcategorization"
    ],
    "approach": [
        "semi-supervisedlearning",
        "unsupervisedlearning"
    ],
    "linguistic": [
        "embeddings"
    ],
    "dataset_type": [
        "news"
    ],
    "languages": [
        "english"
    ],
    "total_citations": 1412,
    "citation_distribution": {
        "2014": 1,
        "2015": 108,
        "2016": 262,
        "2017": 340,
        "2018": 700,
        "2019": 1
    }
}

Paper Record Attributes


_id: id of the paper. Same as acl_id.

acl_id: acl id of the paper. The paper can be found at ACL Anthology https://aclweb.org/anthology/acl_id

paper_title: Title of the paper

ocr_title: Title of the paper as extracted from OCR++

publisher: Publisher information as provided by the ACL Anthology, which contains publications from ACL and Non ACL events such as COLING, HLT etc.

sig_name: Name of the Special Interest Group within the publisher

venue_name: Publication venue

year: Year of publication

month: Month of publication

volume_name: Name of the volume in which the paper was published

pages: Page number in the volume in which the paper was published

doi: Digital Object Identifier of the publication as registered at doi.org

acl_author_id: author_id in the ACL Anthology

acl_author_name: Author name in the ACL Anthology

ocr_author: Author email and name as extracted from the paper using OCR++

figure_caption: Captions of the figures in the paper extracted using OCR++

table_captions: Captions of the tables in the paper extracted using OCR++

footnote: Footnotes in the paper extracted using OCR++

url: List of urls present in the paper

domain: list of main domains in the urls in the paper

ref_ACLcode: acl_id of the papers that the current paper references

reference_cite2ref: dictionary of References and the citation format

task: NLP tasks in this paper

approach: Approaches applied by the authors in this paper

linguistic: Linguistic target of study of this paper

dataset_type: Dataset used in this paper

languages: List of languages that the current paper deals with

total_citations: Citation count of the paper by publications in ACL Anthology

citation_distribution: Year wise citaiton count of the paper by publications in ACL Anthology




Sample Full Text Record


{
    "_id": "D14-1162",
    "year": "2014",
    "full_text": "Semantic vector space models of language represent ..."
}

Full Text Record Attributes

_id: Id of the paper

year: Year of publication

full_text: Full text of the paper




Sample Author Record


{
    "_id": "christopher-d-manning",
    "all_coauthors": [
        "dan-jurafsky",
        "daniel-car",
        ...
    ],
    "all_papers_written": [
        "W08-0304",
        "W09-3206",
        ...
    ],
    "first_paper": 1993,
    "latest_paper": 2019,
    "total_papers": 204,
    "paper_distribution": {
        "1993": 1,
        "1998": 1,
        "1999": 1,
        ...
    },
    "dataset_types": {
        "news": 82,
        "blogs": 10,
        "twitter": 4,
        ...
    },
    "languages": {
        "chinese": 46,
        "arabic": 21,
        ...
    },
    "linguistic": {
        "discourse": 32,
        "embeddings": 35,
        ...
    },
    "tasks": {
        "discourseparsing": 1,
        "informationextraction": 39,
        ...
    },
    "approach": {
        "generativemodel": 14,
        "deeplearning": 17,
        ...
    },
    "total_citations": 9585,
    "citation_distribution": {
        "1994": 4,
        "1996": 2,
        ...
    }
}

Author Record Attributes

_id: id of the author

all_coauthors: coauthors of the author

all_papers_written: list of venues where the author has published papers

first_paper: Year of first publication in ACL Anthology

latest_paper: Latest publication year in ACL Anthology

total_papers: Total papers published in ACL Anthology

paper_distribution: Year wise publication count in ACL Anthology

dataset_types: Datasets and corresponding publication count that the author has worked with

languages: Languages and the corresponding publication count that the author has worked on

linguistic: Linguistic target of study and the corresponding publication count

tasks: NLP tasks and the corresponding publication count that the author has worked on

approach: Approaches and the corresponding publication count that the author has worked with

total_citations: Citation count of the author by publications in ACL Anthology

citation_distribution: Year wise citation count of the author by publications in ACL Anthology




Sample Conference Record


{'_id': 'EMNLP-2017',
 'total_citations': 1403,
 'total_papers': 352,
 'venue_name': 'EMNLP',
 'year': '2017'
 'approach': {'bayesianmodel': 12,
    'dataanalysis': 5,
    'deeplearning': 79,
    'discriminativemodel': 12,
    'generativemodel': 26,
    'graphicalmodel': 23,
    'humancomputation': 1,
    'kernelmethod': 7,
    'multilingualresources': 1,
    'representationlearning': 29,
    'semi-supervisedlearning': 22,
    'structuredprediction': 20,
    'topicmodeling': 18,
    'unsupervisedlearning': 14},
 'dataset_type': {'biographies': 2,
    'biomedicaltexts': 3,
    'blogs': 26,
    'chat': 20,
    'childlanguage': 1,
    'encyclopedia': 7,
    'legaldocuments': 2,
    'literarytext': 3,
    'news': 136,
    'querylogs': 4,
    'scientificliterature': 1,
    'socialmedia': 53,
    'spokendialog': 11,
    'twitter': 44,
    'webcrawl': 2},
 'language': {'arabic': 13,
    'childlanguage': 1,
    'chinese': 76,
    'english': 204,
    'french': 44,
    'hebrew': 3,
    'hindi': 9,
    'japanese': 31,
    'korean': 7,
    'low-resourcelanguages': 5,
    'malay': 4,
    'multilingual': 52,
    'semitic': 4,
    'spanish': 40,
    'tamil': 2},
 'linguistic': {'codemixing': 1,
    'codeswitching': 2,
    'cognitivelinguistics': 2,
    'discourse': 74,
    'distributionalsemantics': 17,
    'embeddings': 208,
    'eventsemantics': 2,
    'formalsemantics': 3,
    'gesture': 1,
    'groundedsemantics': 1,
    'languagechange': 2,
    'lexicalsemantics': 9,
    'morphology': 19,
    'multilingualism': 1,
    'neurolinguistics': 1,
    'ontologies': 6,
    'phonetics': 4,
    'phonology': 7,
    'pragmatics': 4,
    'prosody': 3,
    'psycholinguistics': 3,
    'syntax': 66,
    'typology': 8},
 'task': {'argumentationmining': 7,
    'asr': 17,
    'biomedical': 13,
    'chunking': 16,
    'coreferenceresolution': 21,
    'corpusannotation': 2,
    'discourseparsing': 10,
    'ethics': 1,
    'eventdetection': 6,
    'imagedescriptiongeneration': 3,
    'informationextraction': 40,
    'informationretrieval': 40,
    'knowledgeacquisition': 3,
    'languagegeneration': 22,
    'languageidentification': 5,
    'languageunderstanding': 31,
    'machinetranslation': 137,
    'mathematicalmodels': 1,
    'morphologicalanalysis': 8,
    'namedentityrecognition': 30,
    'nativelanguageidentification': 1,
    'ocr': 27,
    'paraphrasing': 18,
    'questionanswering': 63,
    'relationextraction': 31,
    'semanticparsing': 35,
    'semanticsimilarity': 39,
    'sentimentanalysis': 54,
    'socialscience': 8,
    'spellingcorrection': 3,
    'spokenlanguageprocessing': 1,
    'styleanalysis': 2,
    'summarization': 52,
    'syntacticparsing': 13,
    'tagging': 91,
    'textcategorization': 4,
    'textorganization': 4,
    'textsimplification': 5,
    'textualentailment': 20,
    'videodescriptiongeneration': 1,
    'wordsegmentation': 18,
    'wordsensedisambiguation': 10},
 'reference_yearwise_conf': {'*SEMEVAL-2001': 2,
    '*SEMEVAL-2004': 7,
    '*SEMEVAL-2007': 7,
    '*SEMEVAL-2010': 6,
    '*SEMEVAL-2012': 6,
    '*SEMEVAL-2013': 17,
    '*SEMEVAL-2014': 11,
    '*SEMEVAL-2015': 15,
    '*SEMEVAL-2016': 20,
    '*SEMEVAL-2017': 5,
    'ACL-1986': 1,
    'ACL-1989': 4,
    'ACL-1996': 3,
    'ACL-1997': 1,
    .
    .
    .
    'WS-2012': 31,
    'WS-2013': 54,
    'WS-2014': 71,
    'WS-2015': 55,
    'WS-2016': 108,
    'WS-2017': 22},
 'citation_yearwise_conf': {'*SEMEVAL-2018': 42,
    'ACL-2017': 2,
    'ACL-2018': 314,
    'ALTA-2017': 1,
    'ALTA-2018': 4,
    'AMTA-2018': 5,
    'BEA-2018': 7,
    'CL-2018': 3,
    'CL-2019': 1,
    'CLPsych-2018': 5,
    'COLING-2018': 147,
    'CRAC-2018': 3,
    .
    .
    .
    'TRAC-2018': 1,
    'TextGraphs-2018': 2,
    'WASSA-2017': 1,
    'WAT-2017': 2,
    'WMT-2017': 4,
    'WMT-2018': 39,
    'WS-2017': 17,
    'WS-2018': 266,
    'WS-2019': 2},
}

Conference Record Attributes

_id: id of the conference

total_citations: Citation Count of the Conference by publications in ACL Anthology

total_papers: Total papers published in the Conference

venue_name: Name of Conference

year: Year of Conference

approach: Approaches and the corresponding publication count that the author has worked with

dataset_types: Datasets and corresponding publication count that the author has worked with

languages: Languages and the corresponding publication count that the author has worked on

linguistic: Linguistic target of study and the corresponding publication count

tasks: NLP tasks and the corresponding publication count that the author has worked on

reference_yearwise_conf: Conferences Referred along with count of papers refered from that conference

citation_yearwise_conf: Conferences which cited given Conference along with count of papers which cite given Conference from that conference

SUBSCRIBE TO OUR NEWSLETTER

Join our mailing list and stay up to date on the latest features, beta, test and announcements