Applicant's Wikimedia username. If one is not provided, then the applicant's name will be provided for community review.
- Tirthankar Ghosal
- Textual Novelty Detection in Wikipedia
Entity Receiving Funds
Provide the name of the individual or organization that would receive the funds.
- Tirthankar Ghosal (as an individual)
Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.
- Novelty detection is one of the frontier AI problems that try to imitate human cognitive behavior to identify and consume new knowledge. Novelty Detection aligns with Wikimedia’s overall vision to build the ‘sum of all knowledge'. We envisage that the ‘sum’ would be a novelty update over existing knowledge for a given topic. Here in this project, we would aim to identify if a document contains new knowledge with respect to the previous relevant knowledge. The investigation would only focus on texts as the source of knowledge for a given topic (although we agree that the novelty of a topic is not only confined within texts but also on other dimensions of knowledge). The overall goal would be to come up with a novelty score for a given document and identify the sections of ‘new knowledge’ within the document. It is not straightforward to come up with a threshold for newness as the interpretation of ‘novelty’ is quite subjective. We would leverage the capabilities of recent natural language processing and deep learning techniques (memory networks especially, in conjunction with representation learning via large language models) for our research. One sub-problem that we would need to address first would be to ascertain the relevant premise or source knowledge base against which the ‘newness’ of the incoming target document is to be determined. Finally, we would build a knowledge graph from novel statement triples extracted from different topically relevant documents by our deep models. An incoming document would be topically classified first, followed by identification of novel statements, and then integrated into the topical knowledge graph. The novelty score of the target document would be determined with respect to the source knowledge encoded in the topical knowledge graph. Further, we would use the different topical novelty knowledge graphs to perform inter-graph traversal and find hidden links between concepts (nodes) whose association was not otherwise apparent. This would facilitate novel association discovery between concepts. We would also leverage several prior studies which have used Wikipedia articles to build knowledge graphs. To the best of our knowledge, we find little research on textual novelty detection at the document level in literature. Incorporating knowledge graphs to detect textual novelty would be a 'novel' investigation. Can we imagine a 'novelty score' of a given Wikipedia article highlighting the zones of novel information?
Approximate amount requested in USD.
Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).
- Tentatively the associated costs would be:
- 1. Two Research Interns (from India) - 14000 USD
- 2. P.I. Remuneration - 12000 USD (1000 USD per month)
- 3. AWS Costs - 6000 USD
- 4. Annotator Costs - 8000 USD
- 5. Hardware and Software Costs - 3000 USD
- 6. Travel Costs - 7000 USD (to conferences or journal APCs)
Personnel, Annotators would be mostly hired from India.
Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.
- As indicated earlier, the proposed project to detect newness in-\text aligns with Wiki’s 2030 vision to capture the ‘sum of all knowledge’. To update the sum, the incoming knowledge needs to be judged if it is offering ‘new’ or ‘redundant’ information. With different users updating the Wiki, semantic duplicates should be common. Our investigation would make an attempt to detect semantically redundant information and thereby in the process identify new information as well. Using NLP/ML/KG methods, we would envisage that each new document in Wikipedia would be associated with a newness score. We would also like to point the users to the source premise documents so as to make the score explainable and justifiable.
Plans for dissemination.
- All the research carried out or data built as part of this work would be open-sourced for the community via public Github projects. We would document the progress (timelines, milestones, minutes, etc.) in a public Wiki project. We would publish the research output in top-tier open access AI/NLP/IR/Semantic Web conferences which include AAAI, IJCAI, ACL, NAACL, SIGIR, EMNLP, CIKM, WWW, ISWC, ESWC, etc. We would also showcase our research in the Wikimedia events.
Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.
- 1. Ghosal, T., Edithal, V., Ekbal, A., Bhattacharyya, P., Tsatsaronis, G., & Chivukula, S. S. S. K. (2018, August). Novelty goes deep. A deep neural solution to document level novelty detection. In Proceedings of the 27th international conference on Computational Linguistics (pp. 2802-2813).
- 2. Ghosal, T., Edithal, V., Ekbal, A., Bhattacharyya, P., Chivukula, S. S. S. K., & Tsatsaronis, G. (2021). Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection. Natural Language Engineering, 27(4), 427-454.
- 3. Ghosal, T., Saikh, T., Biswas, T., Ekbal, A., & Bhattacharyya, P. (2021). Novelty Detection: A Perspective from Natural Language Processing. Computational Linguistics, 1-42.
I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.