Importing Table Data Into a Graph Database With GQLAlchemy

Importing Table Data Into a Graph Database With GQLAlchemy

·

5 min read

Are you considering moving from table data to a graph database, but it seems like a complicated migration? With this short tutorial, we are going to show you how to do just that using GQLAlchemy. You will learn how to import table data from files stored in local or online storage to a Memgraph graph database. You can find the original Jupyter Notebook in our open-source GitHub repository.

If at any point you experience problems with this tutorial or something is unclear to you, reach out on our Discord server.

1. Prerequisites

For this tutorial, you will need to install:

  • Jupyter: Jupyter is needed if you want to run this tutorial as a notebook.

  • Docker: Docker is used because Memgraph is a native Linux application and cannot be installed on Windows and macOS.

  • GQLAlchemy: A Python OGM (Object Graph Mapper) that connects to Memgraph

2. Installation using Docker

After you install Docker, you can set up Memgraph by running:

docker run -it -p 7687:7687 -p 7444:7444 -p 3000:3000 memgraph/memgraph-platform

This command will start the download and after it finishes, run the Memgraph container.

3. Connecting to Memgraph with GQLAlchemy

We will be using the GQLAlchemy object graph mapper (OGM) to connect to Memgraph and execute Cypher queries easily. GQLAlchemy also serves as a Python driver/client for Memgraph. You can install it using:

pip install gqlalchemy

Hint: You may need to install CMake before installing GQLAlchemy.

Maybe you got confused when I mentioned Cypher. You can think of Cypher as SQL for graph databases. It contains many of the same language constructs like CREATE, UPDATE, DELETE... and it's used to query the database.

from gqlalchemy import Memgraph

memgraph = Memgraph("127.0.0.1", 7687)

Let's make sure that Memgraph is empty before we start with anything else.

memgraph.drop_database()

results = memgraph.execute_and_fetch(
    """
    MATCH (n) RETURN count(n) AS number_of_nodes ;
    """
)
print(next(results))

4. Making mock table data file

For this tutorial, we are going to create CSV files to represent our data. Using two files, you can also model a relationship between them with a foreign key. But more on that later, for now, you can run the following code to make two CSV files, individual.csv and address.csv.

import csv

header = ["ind_id", "name", "surname", "add_id"]
data = [
    [1, "Tomislav", "Petrov", 1],
    [2, "Ivan", "Horvat", 3],
    [3, "Marko", "Horvat", 3],
    [4, "John", "Doe", 2],
    [5, "John", "Though", 4],
]

with open("individual.csv", "w", encoding="UTF8", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerows(data)
header = ["add_id", "street", "street_num", "city"]
data = [
    [1, "Ilica", 2, "Zagreb"],
    [2, "Death Valley", 0, "Knowhere"],
    [3, "Horvacanska", 3, "Horvati"],
    [4, "Broadway", 12, "New York"],
]

with open("address.csv", "w", encoding="UTF8", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerows(data)

5. Defining a translation

In order to translate the table data into graph form, you need to give the importer instructions on how to do that. For this, you should define a data configuration object which can be a dictionary in Python code or a YAML file.

Here is a general structure of a data configuration YAML file:

indices:    # indices to be created for each table
  individuals:    # name of the table containing individuals with ind_id
  - ind_id
  address:
  - add_id


name_mappings:    # how we want to name node labels
  individuals:
    label: INDIVIDUAL    # nodes made from the individuals table will have the label INDIVIDUAL
  address:
    label: ADDRESS
    column_names_mapping: {"current_column_name": "mapped_name"}    # (optional) map column names


one_to_many_relations:
  address: []    # currently needed, leave [] if no relations to define
  individuals:
    - foreign_key:                # foreign key used for mapping;
      column_name: add_id         # specifies its column
      reference_table: address    # name of table from which the foreign key is taken
      reference_key: add_id       # column name in reference table from which the foreign key is taken
    label: LIVES_IN               # label applied to relationship created
      from_entity: False          # (optional) define direction of relationship created


many_to_many_relations:      # intended to be used in case of associative tables
  example:
    foreign_key_from:        # describes the source of the relationship
      column_name:
      reference_table:
      reference_key:
    foreign_key_to:          # describes the destination of the relationship
      column_name:
      reference_table:
      reference_key:
    label:

For our example, we will use a dictionary to simplify the input process. We want to create nodes for all rows in both tables and a relationship LIVES_IN between an INDIVIDUAL and ADDRESS nodes. Also, you will not be using the many_to_many_relations field, so your data configuration object is:

data_conf = {
    "indices": {"address": ["add_id"], "individual": ["ind_id"]},
    "name_mappings": {
        "individual": {"label": "INDIVIDUAL"},
        "address": {"label": "ADDRESS"},
    },
    "one_to_many_relations": {
        "address": [],
        "individual": [
            {
                "foreign_key": {
                    "column_name": "add_id",
                    "reference_table": "address",
                    "reference_key": "add_id",
                },
                "label": "LIVES_IN",
            }
        ],
    },
}

6. Importing from the local file system

Finally, to import the data into Memgraph, all you need to do is make an instance of LocalFileSystemImporter and call its translate() method.

from gqlalchemy.loaders import CSVLocalFileSystemImporter

importer = CSVLocalFileSystemImporter(
    data_configuration=data_conf,
    path="./",
)

importer.translate(drop_database_on_start=True)

7. Importing from online file storage

Furthermore, we offer support for reading data from online storage providers Amazon S3 and Azure Blob via AmazonS3Importer and AzureBlobImporter. To use the Azure storage, for example, simply use the following snippet:

importer = CSVAzureBlobFileSystemImporter(
    container_name="<your_container_name>",
    data_configuration=data_conf,
    account_name="<your_account_name>",
    account_key="<your_account_key>",
)

For any other service provider, it is possible to implement your custom importer class, here's how. Don't forget that GQLAlchemy is an open source project, so you can submit your extended functionality on our GitHub repository.

8. Querying the database and retrieving results

Let's make sure that our data was imported correctly by retrieving it:

results = memgraph.execute_and_fetch(
    """
    MATCH (n) RETURN n;
    """
)

results = list(results)

for result in results:
    print(result["n"])

If you want to do more with your graph data, visit the Memgraph MAGE graph library (and throw us a star ⭐) and take a look at all of the graph algorithms that have been implemented. You can also implement and submit your own algorithms and utility procedures.

10. Visualizing the graph in Memgraph Lab

Open Memgraph Lab in your browser on the address localhost:3000. Execute the following Cypher query:

MATCH (n)-[r]-()
RETURN n, r;

memgraph-table-to-graph-import

Image 1. Memgraph Lab visualization of the graph structure

What's next?

Now it's time for you to import your data into Memgraph and start using it!

You can always check out Memgraph Playground for some cool use cases and examples. If you have any questions or want to share your work with the rest of the community, join our Discord Server.

Read more about Python and graph databases on memgraph.com