Rails: How I updated table with 26 million rows

Raghu Varma Bhupatiraju
2 min readOct 11, 2022

--

Photo by Rob Wingate on Unsplash

If you ever wondering about updating a rails model with large records(magnitude of millions) this is for you.

I was tasked to update a db table which has 26 million records. Doing it in sql will be faster but I need to leverage on active record while updating.

user.update_columns(new_column1: user.old_column1_attributes,
new_column2: user.old_column2_attributes,
meta_jsonb: user.meta.transform_values!(&:to_i))

Rest of this post is how I achieved it.

First attempt was by doing it in a migration which I realized very quickly how bad it is. Migration happens in a single transaction and it waits till the end to commit all the data at once. I ran out of memory as the transaction is too big.

Second Attempt is to take the code out of migration, make it a raketask and use find_each which loads records in batches (default 1000) to a void memory blot.

User.find_each(batch_size: 100_000) do |user|
user.update_columns(new_column1: user.old_column1_attributes,
new_column2: user.old_column2_attributes,
meta_jsonb: user.meta.transform_values!(&:to_i))
end

With this approach I was able to make it memory effective but still it took time to update the records as some one needs to baby sit it the whole time(takes more than 6 hours).

Third Attempt I split the task into multiple batches with startand finish id’s. Also printing the processed records to keep track of the updated records. This way it is much faster as it is selecting sub set of records and somebody don’t need to baby sit it and pick the next batch based on the last completed record id.

User.find_each(batch_size: 100_000, start: start_id, finish: finish_id) do |user|
p "Processing user id #{user.id}"
user.update_columns(new_column1: user.old_column1_attributes,
new_column2: user.old_column2_attributes,
meta_jsonb: user.meta.transform_values!(&:to_i))
end

Take aways:

  • Move to a rake task for data migrations that take a long time.
  • Use find_each with start and finish to make the selection of records faster.
  • Send the start and finish as parameters to the task to run it in small batches.

Note: I don’t have transaction wrapped around the update because it is only one transaction per user and I don’t want the whole batch to revert when one of it fails. In that case I will look at the logs to get the failed record and go from there.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Responses (3)

Write a response