Grants:Project/Automatic Extraction of Multi-lingual Text and Concept Similarity

From Meta, a Wikimedia project coordination wiki
statusnot selected
Automatic Extraction of Multi-lingual Text and Concept Similarity
summaryWe propose to develop a similarity score for text sections between languages based on a combination of content, activity and graph structure to correlate text sections in different languages. This method will be used to A) Detect "Similar concepts" in different languages, B) Detect parallel sections in text in different languages, C) Detect missing sections in articles in a given language.
targetComparison between Wikipediaes in different languages.
type of grantresearch
amount20,000 USD
type of applicantgroup
contact• lev.muchnik@huji.ac.il
join
endorse
created on11:43, 1 March 2017 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.
We aim at helping Wikipedia editors and users to find and compare concepts between different languages, in the absence of direct link or when the content of the Wiki page is incomplete or significantly different from the corresponding content in other languages.

What is your solution?[edit]

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review [Grants:Project/Tutorial|the tutorial]] for tips on how to answer this question.
We will use a combination of text analysis, graph measures and machine learning techniques to infer similarity metrics between texts in different language. This method will be based on a comparison of the Word2Vec content representation in different languages, the structure of the hyperlink network in the ego-network surrounding a text section and the similarity in the page views time series. Using a combination of similarity scores, we will find the most similar candidate to text section in any proposed language, and compare the sections of the same text in other languages.

Project goals[edit]

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.
We will use a combination of measures and machine learning techniques to infer similarity metric between texts in different languages. This will be useful for: A. Suggestion of the most similar concepts to a concept you know in a given language. B. Suggestion of missing translations of the Wikipedia entries (i.e. missing articles) or sections in an entry in different languages C. Detection of erroneous links between concepts in different languages.

Project impact[edit]

How will you know if you have met your goals?[edit]

For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

The outputs for the similarity in content will be a Markov model matrix representing the probability of observing a given Word2Vec representation in a given language as a function of a different Word2Vec representation in a different language. The graph similarity will be available through an app that will build the ego-graph in a given language and provide links to all concepts in a different language sharing a high enough fraction of neighbors in the ego-graph. The page-view similarity will be provided as an app representing the correlation in the total page views of pairs of Wikipediae pages. The end product will be an app that will receive a Wikipedia page, and a different language and compare the section of the pages in different languages if a parallel Wikipedia page exist. Otherwise it will show the most related pages in the different language. In case of a high dissimilarity between two languages, the app will raise a flag. All of the elements above will be available freely to the Wikimedia editors and users through an open website. The algorithm, the tools and the datasets created over the course of the program will be disseminated via scientific publications and made available to the public.

Do you have any goals around participation or content?[edit]

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.
We believe that availability of such tool will contribute to closing the gap in content quality across Wikipediaes in different languages, improve content neutrality by reducing omission bias and help coordinating editorial activity. Our current estimates suggest that hundreds of thousands of concepts covered in some languages will be identified and suggested for introduction into other languages. In addition, we hope that this project will help multilingual Wikipedia users to access richer information (e.g. complementing readings from one language with content from another).

Project plan[edit]

Activities[edit]

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?
This project will be executed by data science research groups in two universities. Each group will employ a full-time Ph.D. student who will be supervised by the PIs. We will leverage our expertise in analysis of the Wikipedia content in different languages, Wikipedia editing history and our Wikishark.com platform that offers methods to perform computations with Wikipedia page views. The first 9 month of the project will be devoted to development of the analytical methods necessary to accomplish the project. The last 3 month of the project will focus on implementation of the web API and a web site that will make the tool available to the public.

Budget[edit]

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!
Ph.D Student scholarships (20,000 USD)

Community engagement[edit]

Community input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?
We are in constant communication with Israel Wikimedia and will test the app with them. We also plan to publish the methods in a scientific journal to maximize impact. Finally, the algorithm, the tools and the datasets created over the course of the program will be disseminated via scientific publications and made available to the public.

Get involved[edit]

Participants[edit]

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.
Yoram Louzoun (yoraml) and Lev Muchnik (LevMuchnik at gmail com) have been studying Wikipedia for several years, with the first research published in 2007. Both research groups have experience in machine learning, text mining and graph analysis, and lead large research group in this domain. They now plan to use their expertise to propose new tools and methods for the Wikipedia users. The proposed project is the first Wikipedia research project proposed by this group, but they have extensive experience in collaboration with other institutions, analysis of large datasets, and in the development of machine learning-based applications. The Ph.D. students involved in this analysis are experienced in Big Data and construction of web API and web sites that offer analytical tools to the public (e.g.: wikishark.com). List of the relevant publications

  • L. Muchnik, R. Itzhack, S. Solomon, and Y. Louzoun, “Self-emergence of knowledge trees: Extraction of the Wikipedia hierarchies,” Physical Review E, vol. 76, no. 1, p. 16106, 2007.
  • M. Kämpf, S. Tismer, J. W. Kantelhardt, and L. Muchnik, “Fluctuations in Wikipedia access-rate and edit-event data,” Physica A: Statistical Mechanics and its Applications, vol. 391, no. 23, pp. 6101–6111, Jul. 2012.
  • M. Kämpf, J. W. Kantelhardt, and L. Muchnik, “From Time Series to Co-Evolving Functional Networks: Dynamics of the Complex System ‘Wikipedia,’” in ECCS 2012, 2012.
  • L. Muchnik, S. Pei, L. C. Parra, S. D. S. Reis, J. S. Andrade, S. Havlin, H. A. Makse, and J. S. Andrade, “Origins of power-law degree distribution in the heterogeneity of human activity in social networks,” Scientific Reports, vol. 3, p. 23, Apr. 2013.
  • H. Brot, L. Muchnik, and Y. Louzoun, “Directed triadic closure and edge deletion mechanism induce asymmetry in directed edge properties,” The European Physical Journal B, vol. 88, no. 1, p. 12, Jan. 2015.
  • H. Brot, L. Muchnik, J. Goldenberg, and Y. Louzoun, “Evolution through bursts: Network structure develops through localized bursts in time and space,” Network Science, vol. 4, no. 3, pp. 293–313, Sep. 2016.

Community notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?

Endorsements[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).