Rails: How I updated table with 26 million rows
If you ever wondering about updating a rails model with large records(magnitude of millions) this is for you.
I was tasked to update a db table which has 26 million records. Doing it in sql will be faster but I need to leverage on active record while updating.
user.update_columns(new_column1: user.old_column1_attributes,
new_column2: user.old_column2_attributes,
meta_jsonb: user.meta.transform_values!(&:to_i))
Rest of this post is how I achieved it.
First attempt was by doing it in a migration which I realized very quickly how bad it is. Migration happens in a single transaction and it waits till the end to commit all the data at once. I ran out of memory as the transaction is too big.
Second Attempt is to take the code out of migration, make it a rake
task and use find_each
which loads records in batches (default 1000) to a void memory blot.
User.find_each(batch_size: 100_000) do |user|
user.update_columns(new_column1: user.old_column1_attributes,
new_column2: user.old_column2_attributes,
meta_jsonb: user.meta.transform_values!(&:to_i))
end
With this approach I was able to make it memory effective but still it took time to update the records as some one needs to baby sit it the whole time(takes more than 6 hours).
Third Attempt I split the task into multiple batches with start
and finish
id’s. Also printing the processed records to keep track of the updated records. This way it is much faster as it is selecting sub set of records and somebody don’t need to baby sit it and pick the next batch based on the last completed record id.
User.find_each(batch_size: 100_000, start: start_id, finish: finish_id) do |user|
p "Processing user id #{user.id}"
user.update_columns(new_column1: user.old_column1_attributes,
new_column2: user.old_column2_attributes,
meta_jsonb: user.meta.transform_values!(&:to_i))
end
Take aways:
- Move to a
rake
task for data migrations that take a long time. - Use
find_each
withstart
andfinish
to make the selection of records faster. - Send the
start
andfinish
as parameters to the task to run it in small batches.
Note: I don’t have transaction wrapped around the update because it is only one transaction per user and I don’t want the whole batch to revert when one of it fails. In that case I will look at the logs to get the failed record and go from there.