Research:History and cultural heritage of the Canary Islands in the Wikimedia projects/Data corpus

From Meta, a Wikimedia project coordination wiki

This page describes the data models of the different CSV in which the project is going to curate the data we are working.


  1. corpus_eswiki_articulos.csv.
  2. corpus_eswiki_wikidata_commons_articulos.csv.

The data extracted is separated in two corpus. The first one (1) is the curation of all the articles in the Spanish Wikipedia chosen to analyze. The second one (2) is the curation of different aspects of each article. Each aspect is a column in the CSV, the columns of the corpus are:

  • id. A unique identifier for each item.
  • articulo. The name of the article.
  • tamano_bytes. Size of the article in bytes (include the wiki markup).
  • tamano_palabras. Size of the articles in words.
  • fecha_creacion. Creation date.
  • id_creacion. Identifier of the first version of the article.
  • url_creacion. URL of the first version of the article.
  • fecha_ultima_revision. Last revision date.
  • id_ultima_revision. Identifier of the last version of the article.
  • url_ultima_revision. URL of the last version of the article.
  • editores_anonimos Quantity of anonymous editors.
  • editores_registrados. Quantity of registered editors.
  • editor_principal. Main editor (the one that has made most edits).
  • discusion. URL of the talk page if there is one.
  • discusion_tamanho_bytes. Size of the talk page in bytes (include the wiki markup) if there is one.
  • discusion_tamanho_palabras. Size of the talk page in word if there is one.
  • enlaces_a. Quantity of links to another articles.
  • enlaces_de. Quantity of links to another pages (articles or not).
  • imagenes_cantidad. Quantity of images in the article.
  • referencias. Quantity of references in the article.
  • bibliografía. Quantity of items in the bibliography.
  • wikidata_id. Wikidata identifier (Q ID).
  • wikidata_etiquetas. Quantity of labels of the item.
  • wikidata_descripciones. Quantity of descriptions of the item.
  • wikidata_declaraciones. Quantity of statements of the item.
  • wikidata_declaraciones_referencias_P143Quantity of references with P143: imported from Wikimedia project of the item.
  • wikidata_declaraciones_referencias. Quantity of references of the item.
  • wikidata_identificadores_externos. Quantity of external identifiers of the item.
  • wikidata_interwiki. Quantity of interwiki links of the item.
  • commons_categoria. Commons category linked to the Wikidata item.
  • commons_archivos. Quantity of files in the Commons category (recursion level 3).
  • commons_subcats. Quantity of subcats in the Commons category (recursion level 3).


These are the data models expected to be extracted at the beginning of the project, but due to technical and time issues the specific corpus weren't extracted.

  • corpus_articulos_enlace_de.csv. WIP.
  • corpus_articulos_enlaces_a.csv. WIP.
  • corpus_articulos_referencias.csv. WIP.
  • corpus_articulos_bibliografia.csv. WIP.
  • corpus_articulos_wikipedia_versiones.csv. WIP.
  • corpus_articulos_commons_subcats.csv. WIP.