Recently in one of the projects we’ve taken over, which uses MongoDB and is hosted on Heroku, we have been asked to speed up an importing file process. It seemed to be a simple thing but without a few tricks we wouldn’t have made it.

I need to say a few things first, we’re using a standard Heroku app configuration and MongoHQ addon. It’s been enough for our needs so far. The question is why import of 10MB file takes so long? It took around 1h. You would say that the problem is in the code, but it’s a half-truth.

Below you can see an example of a service to import files with BIM (Building Information Model) objects. Every BIM object has properties. The properties may be duplicated. The important thing is that we have 2 loops here. 10MB file may include 10k objects and 5k properties, so the service has to save 15k records in DB.

class ImportBimObjectsService
  def call(model, data)
    bim_object_parser = BimObjectsParser.new(data)

    bim_object_parser.bim_properties.each do |name|
      bim_property = model.bim_properties.find_or_initialize_by(name: name)
      unless bim_property.model_ids.include?(model.id)
        bim_property.model_ids << model.id
      end
      bim_property.save
    end

    bim_object_parser.bim_objects.each do |bim_object_attrs|
      model.bim_objects.create(bim_object_attrs)
    end
  end
end

We can imagine where the problem is. 15k requests to DB isn’t a small number, especially when we’re using MongoHQ and Heroku. Usually DB server is in a different location than webserver, so the latency isn’t so small like it’s on local environment. In our case the difference between Heroku and local environment was quite big. On development we were able to import the file in 7 min, on Heroku in 1h.

How to minimize the number of requests?

We can use MongoDB insert method, however insert doesn’t run validations and it’s on our hands to make sure that our model is correct. We can compare insert with storing raw data in DB. There is the last thing to remember, before we store data, we have to add fields like updated_at and created_at to attributes.

class ImportBimObjectsService
  def call(model, data)
    bim_object_parser = BimObjectsParser.new(data)

    bim_object_parser.bim_properties.each do |name|
      bim_property = model.bim_properties.find_or_initialize_by(name: name)
      unless bim_property.model_ids.include?(model.id)
        bim_property.model_ids << model.id
      end
      bim_property.save
    end

    now = Time.now
    valid_bim_objects = bim_object_parser.bim_objects.map do |bim_object_attrs|
      bim_object = model.bim_objects.build(bim_object_attrs)
      if bim_object.valid?
        bim_object.as_document.merge({ created_at: now, updated_at: now })
      end
    end

    model.bim_objects.collection.insert(valid_bim_objects)
  end
end

Thanks to solution presented above we were able to reduce the number of requests from 15k into 5k, but we can make it even better. Be aware that part of the code responsible for saving properties isn’t optimal. We could reduce find_or_initialize_by calls. To do that, we can use some kind of cache which stores only unique properties.

class ImportBimObjectsService
  class BimPropertyUniqCache
    attr_accessor :objects

    def initialize
      @objects = []
    end

    def add(name, model_id)
      if objects[name].present?
        unless objects[name].include?(model_id)
          @objects[name] << model_id
        end
      else
        @objects[name] = [model_id]
      end
    end
  end

  def call(model, data)
    bim_object_parser = BimObjectsParser.new(data)

    bim_property_cache = BimPropertyUniqCache.new
    bim_object_parser.bim_properties.each do |name|
      bim_property_cache.add(name, model.id)
    end

    bim_property_cache.objects.each do |(name, model_ids)|
      bim_property = model.bim_properties.find_or_initialize_by(name: name)
      bim_property.model_ids += model_ids
      bim_property.model_ids.uniq
      bim_property.save
    end

    now = Time.now
    valid_bim_objects = bim_object_parser.bim_objects.map do |bim_object_attrs|
      bim_object = model.bim_objects.build(bim_object_attrs)
      if bim_object.valid?
        bim_object.as_document.merge({ created_at: now, updated_at: now })
      end
    end

    model.bim_objects.collection.insert(valid_bim_objects)
  end
end

Thanks to BimPropertyUniqCache class we were able to avoid unnecessary requests.

Conclusion:

Remember that access time to MongoDB locally is faster than on Heroku. You can easly bypass it by using mass insert. Unluckily by insert we’re skipping validations and we need to validate records before. We’re forced to write more code, but processing time is significantly decreased. Eventually importing a 10MB file takes around 1 min.

References: