User:OrenBochman/Efficent SQL

From Meta, a Wikimedia project coordination wiki
Speed up your queries!

Coding with efficent SQL[edit]

The tutorial will cover:

  • Why Optimization Matters?
  • Why Indices are most important?
  • How to avoid Unindexed and Unlimited Queries?
  • Using EXPLAIN.

The slide deck for this tutorial has specific examples of optimized queries and simple practices that you can use to speed up your queries.

Simple Prep You Need in order to Take this Tutorial

For the practice exercise, you will access a database with sample data in labs. All you need to access it is a labs account and membership in the 'bastion' project (all users who are members of any project are also members of the 'bastion' project). You don't need to be a member of the tutorial project. Before the tutorial, we suggest that you be sure you can access this database. You need to ssh into bastion (ssh bastion.wmflabs.org) and, once you're in, run

mysql -h tutorial-mysql -u tutorial commonswiki_partial 

You may also want to suggest a query to be used in the demo, via the list below.


Introduction[edit]

To many MediaWiki developers, SQL query performance and optimization is shrouded in mystery. Most know that there are efficient and inefficient queries, and that if they write an inefficient query, it will either be noticed during code review, or it will be noticed because it takes down a wiki, which will prompt an ops person to fix the breakage and yell at the developer who caused it. But few people seem to really understand how query performance works.

In this unit we will answer the following questions:

  1. How can you tell if a query is inefficient?
  2. How do you write efficient queries, and avoid inefficient ones?
  3. What are the best pracatices for SQL that developers should be aware of?

This tutorial will cover the basics of how database engines in general, and MySQL specifically, execute different kinds of queries, and explain why certain queries are executed more efficiently than others and what role indexes play in this process. We will demonstrate better practices by writing efficient queries, and showing you how to use tables and indexes so they facilitate efficient queries, and discuss common pitfalls that result in inefficient queries and how to address them. We will also demonstrate how to obtain a query analysis from MySQL and how to make sense of it.


Concept 1 Indexes[edit]

Indexes are Pre-sorted lists of rows and are the key to speeding up queries.

page_id PRIMARY KEY
page_id page_len
1 10775
67 1305
69 548
70 34
74 50085
77 10292
... ...
INDEX foo (page_len)
page_id page_len
1320629 15
9609464 15
340226 16
725948 16
940255 16
1065064 16
... ...
SELECT * FROM page
ORDER BY page_len LIMIT 5;

by using an index in our table we can the cost of avoid sorting the table which amounts to saving N*Log(N), where N is the number of items in the the table. For english wikipedia we have about N=4,000,000 pages which would cost about 26 million operations.


Concept 2 Limit[edit]

Limiting a query allows the database to return the result earlier without calculating the results for the whole table. This can be done in two ways.

    • Limit the number or rows returned using (LIMIT)
    • Limit the number or rows scanned by making a smarter query, e.g. by using an indexed field.
SELECT * FROM page
ORDER BY page_touched LIMIT 5;


page_id PRIMARY KEY
page_id page_touched
1 20120522185124
67 20120515122531
69 20120115072601
70 20120123124730
74 20120520174854
... ...
INDEX foo (page_len)
page_id page_touched
10212 20050626044452
9609464 20050626044503
340226 20050626044503
725948 20050626044522
31045 20050626044523

When the table is index it is very efficent. Another advantage of limited queries can be seen in the query bellow which does not require the table to be sorted.

SELECT * FROM page
WHERE page_is_new=0 LIMIT 5;

If the table has an index then the following two select queries works as a binary search.

SELECT * FROM page
WHERE page_id = 94109 LIMIT 1;


SELECT * FROM page
WHERE page_id >= 94109
ORDER BY page_id LIMIT 3;

Concept 3 Indexing using two fields[edit]

page_id PRIMARY KEY
lastname firstname phone
Stevens Daniel 526-3401
Stevenson Amy 945-8547
Stevenson John 324-0625
Stevenson John 324-0625
INDEX foo (page_len)
page_namespace page_title page_id
0 Main_Page 3401
0 Nanobots 526
1 Main_Page 8547
2 Catrope 9

Recoomendations[edit]

  • Think about how the Database will run your query.
    • Ask how can I limit scaned rows ?
    • Ask how can I limit returned rows ?
    • What can I precalculate via an Index, Summary table, Counter table, etc...?
  • Add indexes where needed.
  • Use batch queries (when it makes sense).
  • In some cases, denormalize for performance.
    • Add information duplicated from other tables.
    • Summary tables, counter tables, cache tables, etc.


  • Avoid running unindexed queries.
    • caveat: WHERE on rarely false conditions usually OK
    • Unindexed ORDER BY (filesort) never OK
  • ORDER BY expression --> filesort == bad
  • QAvoid Page with OFFSET (or LIMIT 50,50)
    • LIMIT 10 OFFSET 5000 scans 5010 rows
    • Use WHERE foo_id >= 12345 instead
  • Avoid using COUNT(), SUM(), GROUP BY, etc...
    • These do not limit rows scanned.
    • MAX()/MIN() of indexed field on entire table is OK.

Suggest a Query to be Used in the Demo[edit]

Below, add a query you want to see optimized in the tutorial Demo (list query suggestions here):

  1. ContributionScores query
SELECT user_id,
   user_name,
   user_real_name,
   page_count,
   rev_count,
   page_count+SQRT(rev_count-page_count)*2 AS wiki_rank
FROM `bw_user` u
JOIN (
 (
  SELECT rev_user,
   COUNT(DISTINCT rev_page) AS page_count,
   COUNT(rev_id) AS rev_count
  FROM `bw_revision`
  WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
  GROUP BY rev_user
  ORDER BY page_count DESC LIMIT 50
)UNION(
  SELECT rev_user,
   COUNT(DISTINCT rev_page) AS page_count,
   COUNT(rev_id) AS rev_count
  FROM `bw_revision`
  WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
  GROUP BY rev_user
  ORDER BY rev_count DESC LIMIT 50
 )
) s ON (user_id=rev_user) ORDER BY wiki_rank DESC LIMIT 50;

MW:Extension:ContributionScores polls the wiki database to locate contributors with the highest contribution volume - this has NOT been tested on a high-volume wiki. The extension is intended for fledgling Wikis looking to add a fun metric for Contributors to see how much they are helping out.

It is used at translatewiki.net and occasionally causes (very) slow queries there. Example query:

# Time: 120525  6:51:18
# User@Host: twn[twn] @ localhost []
# Query_time: 28.124669  Lock_time: 0.000105 Rows_sent: 50  Rows_examined: 18860849
SELECT user_id,
   user_name,
   user_real_name,
   page_count,
   rev_count,
   page_count+SQRT(rev_count-page_count)*2 AS wiki_rank
FROM `bw_user` u
JOIN (
  (
    SELECT rev_user,
     COUNT(DISTINCT rev_page) AS page_count,
     COUNT(rev_id) AS rev_count
    FROM `bw_revision`
    WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
    GROUP BY rev_user
    ORDER BY page_count DESC
    LIMIT 50
  ) UNION (
    SELECT rev_user,
      COUNT(DISTINCT rev_page) AS page_count,
      COUNT(rev_id) AS rev_count
    FROM `bw_revision`
    WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
    GROUP BY rev_user
    ORDER BY rev_count DESC
    LIMIT 50
  )
) s ON (user_id=rev_user)
ORDER BY wiki_rank DESC LIMIT 50;


Uses

2. Batch query vs. many queries Don't have an example off the top of my head, but this may be a less obvious optimisation with potentially big rewards.

Discussion[edit]

If there are any questions you have about this lesson, ask them! My job, as your adopter, is to help you with any problem you may have. If you don't have any questions that you need to ask, your next step is to take a short test regarding this lesson. If you are ready to take the test, simply tell me (either on this page or on my talk page) and I will hand it out to you.