MongoDB

Derived from the word “humongous” (in English: something bigger than gigantic), MongoDB is one of the most used non-relational databases (NoSQL) in the world. Much of this success is due to the fact that, in addition to the open source version, the company behind Mongo also offers Mongo Atlas, a hosted version of the non-relational database.

Dealing with no-sql databases brings certain complications, especially when considering that Mongo does not enforce typing, which means that each document (record) of a collection (something similar to a table in a relational database) can have a different format, in addition, a variable (similar to a column) in one document can assume different values than in another. In one document, the “createdAt” column can assume the value of “2020-01-01” and in another, “01/01/2020” or even “January 1, two thousand and twenty”.

In addition to the necessary care in the development of applications in production to maintain a certain standard in the collections, analyzing data coming from Mongo is something a little more complex when talking about analytics and data science, since basic functions such as the sum of a variable/column need deal with the fact that some of the documents may take on non-numeric values.

When we developed our first data source with Mongo, a lot of manual casting and normalization was required for a relational structure. Today we are releasing a new version that includes:

  • Option to automatically detect the type of column/variable and convert it to a defined format so that this data can be inserted into a relational database
  • Normalizing nested/nested documents to the tables that are needed by mapping the document from a non-relational structure to a relational one
  • Incremental integration, being possible to search only the new or updated documents of each collection
MONGO_5.png

This data source supports the following MongoDB versions: 2.6, 3.0, 3.2, 3.4, 3.6, 4.0, 4.2, 4.4 and 5.0

_ids that are other JSON objects are not supported - only collections/views whose _id is of type ObjectId, String or some other simple type are supported

Adding the data source

To add the mongo data source on the Kondado platform, you first need to allow our IPs to access the database;

Once you have allowed our IPs on your firewall, you can add your information in 2 ways: individual parameters or connection string. To select between them, use the “Connection method” parameter.

If you have selected the connection method Connection string, you can add the connection string in the “Connection string” field. Some examples of connection strings are below. 

Connection string examples:

mongodb://mongodb0.example.com:27017
mongodb+srv://server.example.com/
MONGO_6.png

If you choose the Individual Parameters connection method, see the description of the required fields below:

MONGO_7.png

Name: An internal description of your data source, for example “Mongo product”

Address: IP or DNS of your database or cluster. Enter only the address here, without including database, port or parameters

DNS Seedlist: Check this option only if your database uses seedlist for connection (ie: its connection string contains “+srv”). This option is widely used in MongoDB Atlas databases.

Port: the port that should be used for the connection (usually 27017).

Authentication database: just the name of the authentication database (for example: admin)

Query database: just the query database name

User: the database user

Password: the user's password

Now, you can create your pipelines (one per collection) and start analyzing the Mongo data relationally!

Pipelines

Relationship chart

Collections & Views

Our integration manages to get data not only from collections, but also views.

The schema of the table(s) that will be created in your database will depend on the following parameters:

Normalize: Indicates if, if you have a collection/view with nested/nested documents, you want a new table to be created for this new level. If you choose not to normalize, the data will be received in text mode. Normalization is only applied 1 level down and records in the second level will inherit the _id of the respective first level records.

Cast: If you choose to apply a cast, we will sample the documents in your collection/view to determine the type that each field should have in your destination. This sampling is strict and in case all the records of a given field are not all of the same type, its column will be created as text. If you choose not to cast, your integration cannot be incremental and all fields will be created and inserted as text.

(Optional) Sample schema _ids (separated by a comma): If you have some records that have a more complete schema, you can provide them here so that the fields are considered in the assembly of the final schema

(Optional) fields where casting should not be applied (comma separated): If you select Cast=Yes, you can provide field names (comma separated) that must necessarily be written as text and not be cast to another type that may be mistaken for their pattern. This is useful for fields that store document/phone/ZIP numbers and which, although they look like numbers, a conversion to number can change their value – for example, eliminating the leading 0 of a zip code

The tables created will have a format similar to the one below (considering that the “nested” table is a normalization):

Main object

FieldType 
_idtext 
col_xtext 
col_ytext 
col_ztext 

nested

FieldType 
_idtext 
text  
text  
text