Skip to main content

Distroid: Graph Data Model for Catalogue

Published onApr 11, 2024
Distroid: Graph Data Model for Catalogue
·

Here is our early work on developing a graph data model for the catalogue.

Graph Data Model

Relations Table

Nodes Table

Visualization

You can find our visualization of the graph data model below.

Visualization of graph data model.

Unfortunately, we could not get the edge labels to render this time. We will fix it in a later version.

Example Graph

I made an example graph, based on the graph data model, in Neo4j.

I included the following entities:

  1. Person,

  2. Role,

  3. Works,

  4. Identifier,

  5. Keyword,

  6. Collection, and

  7. MediaSource.

I included the following relationships:

  1. HAS_AUTHORED,

  2. LOCATED_IN,

  3. HAS_PUBLISHED_IN,

  4. AUTHORED_WITH,

  5. HAS_IDENTIFIER,

  6. HAS_KEYWORD,

  7. HAS_LANGUAGE,

  8. HAS_EVENT,

  9. REFERENCES_MARKER,

  10. HAS_USED_MARKER,

  11. CURATED_IN, and

  12. HAS_ROLE.

The data for the example graph was collected from:

  1. DAO Times,

  2. Reboot,

  3. Rehash: A Web3 Podcast,

  4. Green Pill Podcast, and

  5. Rest of World.

You can find more information on the sources above in the table below.

For the next update, I will probably add Language and Sensemaking Marker nodes to the graph.

Visualizations

Overall

Sensemaking Markers and Events

Early Takeaways

Sensemaking markers and Events

There were not many issues here regarding the Sensemaking Marker.

The marker (Informative) was not too difficult to add because I could refer to the sensemaking marker registry.

Adding the Event was more difficult because of how to define an Event, and needing to make a sample dataset to describe an event (fields for the name of event, the creator of the event, the work associated with the event, the markers associated with the event).

The main issue I had was how to define an event associated with a marker. I was thinking initially that I could say Person-Event-Work (here that would be Charles Adjovu - Rated_Informative -> "ShopeeFood miscalculates delivery distances. Gig workers in Vietnam are paying the price"), with the marker details as properties, but I realized that could lead to issues because we would need to define the marker with the same details pretty often, and it would make it hard ot traverse the graph based solely on how people and organizations are making sense of the works. So I added Markers as a node to make this task easier.

I am wondering if Event should have a type system as well (Rating, Review, etc.). Additionally, I am thinking about re-naming the relationship from HAS_EVENT to ATTENTION_ON to emphasize that Events are nodes describing how people or organizations are paying attention to a Work.

Lastly, adding the event also required adding a new role, Commentator. I originally planned for this role to include any person or organization that commentates on a work, but I may need to further delineate this role into additional roles, such as Rater, Reviewer, and Annotator (if only applying a label to the work).

Language

I did not have any issues adding Language node (here being EN for English) to the graph.

Though, I may need to add a property to the relationship for the dialect (or new nodes for each language dialect).

I did not use any language detection tools this go around, but for future use, I will do so to help speed up language detection.

I was having trouble using the Identifier label because I could not think of unique names for works and person identifiers early on.

I decided to go with the following nomenclature for identifiers: “space/item_type/id.”

Here is an example for a YouTube video: "youtube/video/2feau76NG5c.”

Here is an example for a Twitter profile: "twitter/profile/owocki."

I do not expect that platforms will duplicate IDs often, so I think this nomenclature is a safe bet for identifiers.

Nodes that have a HAS_IDENTIFIER relationship.

Possibly, adding Platforms (i.e., services) nodes could also address this issue.

Unique IDs for Person nodes

I also realized that there will naturally be an issue ID-ing Person nodes because a single ID may be insufficient to uniquely identify a person. I think a possible work-around is hashing the social profiles and the name of the Person with a hashing function, and using the hash an identifier.

I think this is also beneficial for Works as well.

Alternatively, relying more on knowledgebases for entity linking, or simply using the IDs provided by a knowledgebase for an entity to produce a hash.

Classification system for things

From trying to address issues with unique names for identifiers, I also realized that I probably need to add a Thing label for nodes, to help classify the thing (or object) a node (or relationship) is.

For example, an Identifier can be considered a thing, while an Identifier label can be replaced with the name of the identifier (e.g., TwitterProfile), with the id set as the name (e.g., owocki).

I think this could help address the naming for Identifiers issue, as well as making Works easier to search for, based on their item type (e.g., video, audio).

Time consumption

I felt that adding all of these nodes (and primarily, the Identifier nodes) took too much time.

It may be best to first start with the following nodes to speed up populating the graph:

  1. Works,

  2. Person, and

  3. MediaSource.

Comments
0
comment
No comments here
Why not start the discussion?