Catalogue KG Prototype for Graph Data Model V0.2
In this pub, I document my work on creating a prototype Knowledge Graph (KG) (“Prototype”) that conforms to my iterative updates to the Catalogue’s Graph Data Model V0.2 (“Model”).1
I go over my process for creating the Prototype, statistics and visuals of the Prototype, and my reflections on developing the Prototype.
Term | Definition |
---|---|
Prototype | A prototype Knowledge Graph (KG) that conforms to the Model |
Model | The Catalogue’s Graph Data Model V0.2 |
I ingested 44,785 Works (including text, audio, and video media types) from 352 Media Sources in the Media Sources directory.
You can find the Media Sources directory below.
I used Feedparser, YouTube API, and Podchaser to collect Works from Media Sources.
For Works collected via Feedparser, I may not have collected all of the Works from a particular Media Source because the RSS feed was missing some of the Works.
After collecting the Works, I examined the metadata to find contributor information and add it to the Works dataset.
I then transformed the Works and associated metadata to conform to the Model by creating new datasets for entities and relationships, primarily with Pandas. As part of the transformation, I used the GLiNER library to conduct named entity recognition (NER) on the contributors to a Work, to classify them as a Person, Organization, or Unclassified.
After I transformed the data to conform with the Model, I loaded the data into an AuraDB instance, a cloud graph database service offered by Neo4j. AuraDB provides access to a browser for querying and visualizing the graph database.
Area | Count |
---|---|
Number of Entities | 46,848 |
Number of Relationships | 104,929 |
Types of Entities | 5 |
Number of Properties set | 39 |
Types of Relationships | 8 |
Entity | Count |
---|---|
Person | 1,576 |
Work | 44,785 |
Organization | 132 |
Unclassified | 3 |
MediaSource | 352 |
Relationship | Count |
---|---|
DISTRIBUTES | 89,570 |
PUBLISHED_IN | 89,570 |
UNCLASSIFIED_CONTRIBUTION_IN | 142 |
GUEST_APPEARANCE_IN | 466 |
OWNS_OR_CONTROLS | 704 |
WORKS_PUBLISHED_IN | 2,944 |
COLLABORATED_WITH | 4,404 |
CONTRIBUTED_TO | 22,058 |
If you would like to access the Prototype on AuraDB, please send an email to [email protected] with RE: Access to Catalogue KG Prototype on AuraDB in the subject line.
Trying to get authorships for Works was tiring at times. Some of the tools I used included authorship (and more broadly, contributor) information in the output (specifically, the person or organization that created the Work, rather than simply the owner or controller of the mediaSource), while others did not. For those that were missing authorship information, I used Requests, Beautiful Soup 4 and Newspaper3k to scrape the content and extract authorship information. In the case of Mirror, despite MediaSources on the Mirror platform having RSS feeds, their RSS feeds were empty more often than not. So I had to use a combination of the aforementioned web scraping tools to extract contributor information, date, and other metadata.
Some of the Media Sources had relocated their Works to new sites, rendering their Works inaccessible. For example, OurNetwork migrated the content of their Works from Substack to their own Ghost instance. Even though I could still collect data from their Susbtack via Feedparser, I could not access the content of OurNetwork’s Works because OurNetwork had edited their Works on Substack to remove substantive content. Thus, leading to multiple copies of the same Work, but one version (Ghost) having more substantive content than another (Substack).
I noticed that I could not find the workURL for some of the Works collected with Feedparser. For example, the On the Other Side podcast has an RSS feed containing information on Works published. I examined my collected Works dataset, and realized that certain Works from On the Other Side were missing their workURL. I opened the RSS feed in the browser to see if there was a discrepancy between the information on the RSS feed webpage and information collected from Feedparser. As it turns out, the Works on the RSS feed webpage happened to be missing the workURL.
I found out that Unchained Crypto has three podcasts, which are all published on its YouTube channel. After investigating its YouTube channel, I noticed that there were podcast-specific playlists. So I had to recollect Works from Unchained Crypto from each podcast-specific playlist, and remove any Works I collected that were not in these playlists. I did the same for DAO Talk, after finding out that the Tally YouTube channel had a specific playlist for DAO Talk. Lastly, I had to add two MediaSources, Bits + Bips and The Chopping Block, to the Media Sources directory.
Thus, I realized that when collecting Works from YouTube, I should collect Works per playlist, rather than collecting all Works from the YouTube channel’s uploads playlist.
I found it difficult to determine the entity type of contributors for Works.
For example, I tried to determine the author of The Codeless Conduct Substack newsletter from The Codeless Alpha.
The Codeless Alpha, though it has a Substack account, is actually an event, and thus, it cannot be a contributor to a Work. So, I chose to replace The Codeless Alpha with Nichanan Kesonpat, who is the other account listed under the People heading on the Substack newsletter.
A similar situation also arose with Amy Zhang and PolicyKit (a technology).
This led to me making some editorial decisions to remove the non-human entities from the database, and replace them with the Person associated with the technology or event.2
Given the automated nature of entity recognition with GLiNER, I assume that there some entities in the KG prototype that are mislabeled, or do not fall under an entity, as described in the Graph Data Model V0.2.
After I conducted NER with GLiNER on authors and guests associated with Works, I assumed or trusted that any entity labeled as a Person or Organization was correct3, and focused on determining if entities with other labels were a Person or Organization.
An additional issue that arose was that some Person entities only had a single name (or simply, their first name or a username/pseudonym), thus making it very hard to determine who the person is or if they are even a person.
I had to remake the graph multiple times because I would change the name of a relationship or entity to comply with my updates to the Graph Data Model V0.2.
I decided not to include other entities for this Prototype. Though, I think when I update this Prototype or make a new prototype, I will focus only on Work, MediaSource, and Identifier entities because trying to identify if an entity is a Person or Organization, and finding the entity’s CONTRIBUTED_TO and OWNS_OR_CONTROLS relationships, can be very time-consuming.
I think the WORKS_PUBLISHED_IN, UNCLASSIFIED_CONTRIBUTION_IN, and GUEST_APPEARANCE_IN relationships may be redundant because you can find the same information by querying the CONTRIBUTED_IN relationship and PUBLISHED_IN relationship.
The DISTRIBUTES relationship may also be redundant because you can also find the relationship by querying for the PUBLISHED_IN relationship.
Let me know your thoughts on whether the WORKS_PUBLISHED_IN, UNCLASSIFIED_CONTRIBUTION_IN, and GUEST_APPEARANCE_IN relationships are redundant.
I was considering using Author as a somewhat shorthand or intermediate label for entities before labeling them as a Person or Organization. I chose not to create an Author label for this prototype because I could use GLiNER for NER, and instead went with creating an Unclassified entity label. I used the Unclassified entity type for entities that GLiNER did not label as a Person or Organization and after reviewing GLiNER’s initial labeling, that I could not determine or label as a Person or Organization (e.g., TBA or Admin).
Another issue I ran into was how to save properties that have an object as a value.
For the CONTRIBUTED_TO relationship, the relationship has two properties, contributionType and contributionTaxonomy, that are interrelated.
I wanted to save these properties as an object (dictionary or JSON), but I could not save it as an object because Neo4j does not allow saving dictionaries or JSON objects, so I saved the aforementioned properties as their own properties, rather than as a single property.
I think next time, I will save the properties as a string representation of JSON, and save as one property.
I wanted to populate the graph as much as possible, even if that meant sacrificing data quality for this prototype because I would rather prune the database at a later time, rather than conduct quality checks at the beginning. Conducting quality checks can be time intensive4, and for this Prototype, I wanted to focus more on populating the database and showing how it can be useful in finding digital media.
Mention OurNetwork and that adding sub-works in anthologies can be helpful in better showing every contribution and finding potential collaborators
An additional issue I ran into were that certain Works were probably better classified as anthologies, rather than an individual work.
For example, ON-257: Onchain Culture 🌐 is a Work that is a collection of analyses by various analysts.
In the future, I plan to include these sub-works in the Model so I can also better reflect how people and organizations are contributing to Works, and make these sub-works discoverable.
I could not load the JSON-LD fields (@context and @id) directly into Neo4j because of the ampersand at the beginning of the fields (Neo4j does not allow property keys to start with non-alphanumeric characters), so instead, I named these fields in the entity and relationship properties as jsonld_context and jsonld_id.
It should be very easy to convert these fields into proper JSON-LD formatting in the future, if needed.
I am contemplating whether to change the MediaSource label to Publication.
As I was examining the content of some of the Works, I noticed that some Works mentioned contributors other than the author. For example, in We're All Marxists Now: The Paradoxes of Self-Sovereignty, the authors mention mentioned that they received support (i.e., contributions) from Miru, Tasha Kim, and Tom White. Additionally, there were many podcast guests mentioned in the Works (some as authors, others in the Work’s description), and I wanted to include them in the Prototype as well.
I then decided to change the HAS_AUTHORED and GUEST_ON from relationships, to values of the CONTRIBUTED_TO relationship for the contirbutionType property.
The CONTRIBUTED_TO relationship is broad enough to encompass any contribution to a Work, and can allow for multiple contribution taxonomies (those developed by the MediaSource or by a third party) to be referenced in the Prototype.
I believe focusing on contributors, rather than solely authorship, can better reflect the nature of how people and organizations work together to produce a Work, and help acknowledge contributions which are somewhat neglected or often not mentioned.
I found that developing the Prototype was extremely helpful in learning about the skills and knowledge needed to overcome obstacles in developing a data pipeline, and also in the practical application of the data model and determining any updates thereto.
I am seeking feedback on this pub for any improvements to make, errors to correct, or other areas to explore.
Please leave your feedback here, on the Ledgerback discussion forum, or on Twitter.