User:MPopov (WMF)/Notes/Refinery

From Meta, a Wikimedia project coordination wiki

This page provides a brief introduction to working with Analytics Engineering's Refinery source code for the purpose of a Product Analytics skillshare on UDF development.

Setup[edit]

  • Required
    • Analytics Refinery source repository: git clone ssh://gerrit.wikimedia.org:29418/analytics/refinery/source
    • Java Development Kit (JDK) 8
      • macOS & Linux binaries are available from Oracle
      • Linux users have the option of OpenJDK: sudo apt-get install openjdk-8-jre
      • Once installed I recommend setting the JAVA_HOME environment variable in your ~/.bash_profile or ~/.bashrc:
        export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
        export PATH="$JAVA_HOME/bin:$PATH"
        
    • Apache Maven (Mac with Homebrew: brew install maven; Linux: sudo apt-get install maven)
  • Recommended IDE for Java: IntelliJ IDEA (Community Edition)
    • There are currently issues with some of the stuff in the repo and IDEA not recognizing sources, so installing the Apache Avro plugin wouldn't hurt (but also doesn't seem to help)
    • Python users might recognize the company as the makers of PyCharm

Basics[edit]

First, run mvn package while in the directory where you cloned the repo to. This should download all the necessary dependencies into ~/.m2/ and build the refinery source code into binary JARs.

The generated refinery-hive/target/refinery-hive-X.Y.ZZ-SNAPSHOT.jar is what you would import in your Hive query via statements like:

  • ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
  • ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;

Refer to Analytics/Systems/Cluster/Hive/QueryUsingUDF for more details.

Development[edit]

Importing project into IntelliJ IDEA[edit]

Import Project and select the cloned repo directory. Pick Maven under "Import project from external model" and proceed with all the default choices until the project has been imported.

If at any point you're at an SDK selection screen, you need to pick JDK8 that you installed earlier.

  • If you don't see JDK8 that you installed earlier and need to + it to the list:
    • On a Mac, add: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
    • On a Linux PC, the directory is something like /usr/lib/jvm/java-8-openjdk/ but I suggest running which javac to confirm
  • Refer to Working with SDKs and Configuring IntelliJ Platform Plugin SDK for help

IDEA will then index the files and give you an error about org.wikimedia.analytics.schema symbol not resolving. Ignore it – Nuria and I have no idea how to fix this as our best bet of installing the Apache Avro plugin didn't work. You can at least write code and get all kinds of helpful hints & code completion suggestions in IDEA, and then just test/build in CLI with mvn package.

Unit Tests[edit]

They're good and you should write them.