Take Home Exercise 3

Mini Case 3 of Vast Challenge 2023

Author

Oh Jia Wen

Published

June 18, 2023

Modified

June 18, 2023

1. OVERVIEW

FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. FishEye has transformed the database into a knowledge graph which includes companies, owners, workers, and financial status.

1.1 The Task

In this take-home exercise, anomalies in the business groups are identified using the knowledge graph FishEye created from the database. In addition, there is a visual analytics process to find similar businesses by focusing on the business’s most important features.

2. Datasets

The trade data is stored in MC3.json file. There are a total of 27,622 nodes and 24,036 edges in the knowledge graphs.

2.1 Metadata

Location	Variables Name	Description
Node, Edges	id	Identifier of the node
Node	country	Country associated with the entity
Node,Edges	type	Type of node. Could be a person or a company
Node	revenue_omu	Operating revenue of `id` node in Oceanus Monetary Units
Node	product_services	Description of product services that `id` node does
Edge	source	ID of source node
Edge	target	Id of target node

3. Data Preparation

3.1 Install R-packages

Using p_load() of pacman package to load and install the following libraries:

jsonlite : To import data from JSON File into R
tidygraph: For graph manipulation
ggraph: For visualizing graphs and networks
visNetwork: For network visualization
tidytext: To tidy, transform and analyze text data
tidyverse: A collection of R packages use in everyday data analyses. It is able to support data science, data wrangling, and analysis
RColorBrewer: For visualization. Contains ready-to-use color palettes to create visually appealing plots
wordcloud2: For creating word clouds, which is a visual representations of word frequency
DT: For creating interactive data table. Provides functions to display, filter and sort data with interactive exploration and analysis
treemap: For visualizing hierarchical data using nested rectangles
knitr: For dynamic report generation
ggiraph: For creating interactive ggplot2 graphics.
hrbrthemes: For Additional Themes, and Utilities for ‘ggplot2’
scales: For customizing the appearance of axis and legend labels
igraph: For exploring and analyzing networks

pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, 
              tidytext, tidyverse, RColorBrewer, wordcloud2,DT,
               treemap,knitr,ggiraph,hrbrthemes,scales,igraph)

options(scipen = 999)

Tip

options(scipen = 999) : removes scientific notation in our exercise.

3.2 Importing Data

The JSON file will be imported into R with the use of fromJSON function from [jsonlite] package. The code chunk below shows the knowledge graph that FishEye have transformed from a database given by an international finance corporation with regards to fishing related companies.

MC3_challenge <- fromJSON("data/MC3.json")

3.2.1 Extracting Edges

As the imported data file is a large list, we will extract the edges from MC3_challenge and save it as a tibble data frame called MC3_edges. The code is extracted in the following manner:

distinct() is used to remove duplicated records
mutate() and as.character() are used to convert field data type from list to character
group_by() and summarise() are used to count the number of unique links
filter(source!=target) is used to ensure that both companies are not identical

MC3_edges <-as_tibble(MC3_challenge$links) %>%
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source,target, type) %>%
  summarise(weights = n()) %>%
  filter (source != target) %>%
  ungroup()

3.2.2 Extracting Nodes

Similarly, we will extract the nodes from MC3_challenge and save it as a tibble data frame called MC3_nodes. The code is extracted in the following manner:

mutate() and as.character() are used to convert data type from list to character
as.numeric(as.character()) are used to convert revenue_omu from list to character before converting it to numeric data type.
select() is used to reorganize the sequence

MC3_nodes <-as_tibble(MC3_challenge$nodes) %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  select(id,country,type,revenue_omu,product_services)

4. Creating a Master ID data frame

We would like to create a master ID data frame from the knowledge graph. As there might be source or target that are not reflected in MC3_nodes. The code below select the source, target column individually and concatenate it through rbind(). Thereafter, a left_join() is used to append into MC3_nodes data frame.

Additionally, we rename the id to label and create an id column through mutate() and nrow() . Similarly, we rename the source and target to sourcelabel and targetlabel respectively and use left_join() to append the id before renaming it.

Show the code

#default masterlist 
id1 <- MC3_edges %>%
  select(source) %>%
  rename(id = source)
id2 <- MC3_edges %>%
  select(target) %>%
  rename(id = target)
MC3_nodes_master <- rbind(id1, id2) %>%
  distinct() %>%
  left_join(MC3_nodes,
            unmatched = "drop")

#create new node df to include id number
MC3_nodes_Masterlist <- MC3_nodes_master %>%
  select(id) %>%
  distinct() %>%
  rename(label = id) %>%
  ungroup()

#add ID to nodes dataframe
MC3_masternodes <- MC3_nodes_Masterlist %>%
  mutate(id = as.character(1:nrow(MC3_nodes_Masterlist))) %>%
  relocate(id,label) %>%
  ungroup()

#to append correspoinding id through left_join 
MC3_edges_addID <- MC3_edges %>%
  rename(sourcelabel = source, targetlabel = target) %>%
  left_join(MC3_masternodes, by = c("sourcelabel" = "label")) %>%
  rename(source = id) %>%
  left_join(MC3_masternodes, by = c("targetlabel" = "label")) %>%
  rename(target = id) %>%
  relocate(source,target)

#output for dataframe using knitr:: kable
kable(head(MC3_masternodes), "simple")

id	label
1	1 AS Marine sanctuary
2	1 Ltd. Liability Co Cargo
3	1 S.A. de C.V.
4	1 and Sagl Forwading
5	2 Limited Liability Company
6	2 S.A. de C.V.

5. Exploratory Data Visualization

In this section, we will design plots with interactivity for users to explore the data. The plots are created with the use of [giraph], and [treemap] packages. We examine the distribution by type for nodes and edges , the distribution of the revenue, and the distribution by country with divergent into type.

Bar chart is chosen to show segments of information by comparing different categorical variables. Aggregation is first performed before creating the interactive graph. Moreover, tooltip is used to highlight the count of the data while scale_fill_manual() is used to standardize the color in each graph.

Note: The Treemap shows the Top 10 countries.

5.1 Type & Revenue

Observations:

Distribution by Type : There are three categories in Nodes and two categories in Edges. We observed that there are only Beneficial Owner and Company Contact in Edges whereas Company information is only available in Nodes. Therefore, we can infer that the type in Edges refer to the connection/ownership that the people are often associated with.
Distribution by Revenue: As observed, the distribution is right-skewed with higher proportion in $10k to $100k, followed by revenue <$10k. The counts decreases as revenue bins increases. There are eight ID who earns more than $100M in revenue. However, more research have to be done to be able to determine if the revenue gain is achieved over time or in a specifc time period that reciprocate to its strength, size, or selling proposition.

5.2 Country

As we have installed the package through [devtools] previously. The github (#install_github) is not installed. Instead, we load the library of [d3treeR] instead. Note: only required to install the github package once.

#install_github("timelyportfolio/d3treeR")
library(d3treeR)

The d3tree function from the [d3treeR] package is used to build our interactive treemap.

d3tree(e5, rootname = "Categories by Country")

Observation:

Distribution by Country: Among the countries, ZH has the highest size and it contains the darkest gradient that belongs to Beneficial Owner. Meanwhile, the rest of the countries only have information about the company .

6. Network Visualization and Analysis

Based on the distribution by type, we could like to explore the network distribution to identify anomalies. We create two separate networks to focus on Beneficial Owner and Company Contacts individually. We compute the centrality between the nodes, by using the graph_from_data_frame function from the [igraph] package.

We used Betweenness Centrality in our network to identify nodes who may have considerable influence instead of other centrality. visNetwork is used to plot the plot the interactive network graph with the Fruchterman and Reingold layout.

Beneficial Owner
Company Contacts

Observations:

Majority of Beneficial Owner have exclusivity or a partnership with 1-2 owners per company. However, we identified companies such as Dutch Oyster Sagl Cruise ship , Ola de la Costa N.V. , Niger Bend AS Express , and Bahia de Plata Submarine with relatively high ownership.
Similarly, the Company Contacts maintains exclusivity/partnership with the liaise company. As observed, there are 1 company who maintain high contacts. In addition, there are personnel who form multiple connections with Companies. For instances, Jennifer Johnson have 5 connections, David Smith have 4 connections, and John Williams have 3 connections.

6.1 Identify Connections

From the network visualization, we noticed that there are people with high ownership/connections. Henceforth, a bar chart is plotted to get a better visualization of their connections. Since we are identifying anomalies, we aggregated the data and filtered away exclusivity of count equals to one.

Beneficial Owners
Company Contacts

Observation:

As the ownership/connections increases, the frequency decreases. Henceforth, to detect the anomalies, we would look into Beneficial Owner with 4 or more companies, and Company Contacts with 3 or more.

6.2 Identify Personnel

To identify the personnel with high ownership/connections, we will filter() them based on the criteria above and concatenate it through rbind(). Thereafter, we use distinct() to ensure that there is no duplicate ID and a left_join() to append from MC3_edges_addID data frame.

With the finalized table, datatable() from [DT] package is used to display MC3_anomalies as an interactive table with search, sorting features.

Show the code

#filter for personel with high ownership
MC3_edges_cc <- MC3_edges_cc_agg %>%
  filter(count >=3) %>%
  ungroup()

#filter for personnel with high connections 
MC3_edges_bc <- MC3_edges_bc_agg %>%
  filter(count >=4) %>%
  ungroup()

#combined the id in both list and merge with edges 
MC3_anomalies <- rbind(MC3_edges_cc, MC3_edges_bc) %>%
  distinct() %>%
  left_join(MC3_edges_addID, by = c("to" = "target"),unmatched = "drop") %>%
  rename(target = to) %>%
  select(sourcelabel,targetlabel, type) %>%
  ungroup()

#output for dataframe using datatable:: dt
DT::datatable(MC3_anomalies)

7. Product and Services

Since we are unable to gain insights based on the revenue, we would like to look into the business’s most important features - the product and services offered. Given an unique selling point, a product/service determine the prospect of the business.

To begin with, we will perform text sensing by using the [tidytext] package. We used tokenisation to break up the given text into units called tokens. The unit could be an individual words, phrases or the entire sentences. Thereafter, unnest_token() is used. The unnested text goes to the output column - word after extracting it from product_services.

Show the code

token_nodes <- MC3_nodes %>%
  unnest_tokens(word, product_services)

#output for dataframe using knitr:: kable
kable(head(token_nodes), "simple")

id	country	type	revenue_omu	word
Jones LLC	ZH	Company	310612303	automobiles
Coleman, Hall and Lopez	ZH	Company	162734684	passenger
Coleman, Hall and Lopez	ZH	Company	162734684	cars
Coleman, Hall and Lopez	ZH	Company	162734684	trucks
Coleman, Hall and Lopez	ZH	Company	162734684	vans
Coleman, Hall and Lopez	ZH	Company	162734684	and

From the table above, we noticed that there are word that might not be useful for our analysis. E.g. and in row 6. Therefore, we would like to exclude such word in our analysis. From the code chunk below, stop_words() from the [tidytext] package is used to aid us in the filtering progress while anti_join() from the [dplyr] package is used to remove all stop_words from the analysis.

Show the code

stopwords_removed <- token_nodes %>% 
  anti_join(stop_words)

#output for dataframe using knitr:: kable
kable(head(stopwords_removed), "simple")

id	country	type	revenue_omu	word
Jones LLC	ZH	Company	310612303	automobiles
Coleman, Hall and Lopez	ZH	Company	162734684	passenger
Coleman, Hall and Lopez	ZH	Company	162734684	cars
Coleman, Hall and Lopez	ZH	Company	162734684	trucks
Coleman, Hall and Lopez	ZH	Company	162734684	vans
Coleman, Hall and Lopez	ZH	Company	162734684	buses

As observed, the word and has been removed from the analysis and have been replaced with buses. Despite so, it seems that the extracted word could be further categorized into a broader category.

7.1. Word Frequency

To get a better graphical representations of the word frequency, a word cloud is used to give greater prominence to words that appears more frequently in a source text. General words such as character, 0, unknown, including and related that do not bring insights are removed from the frequency count. Thereafter, we create a dataframe stopword_removed_freq through the following steps:

filter(!) to remove generic word
group_by() word and summarize() by counting the frequency
arrange(desc()) to sort in descending order

The [worldcloud2] package is used for easy and quick insights to the text. For standardization, set.seed() is incorporated.

Show the code

#remove generic words 
remove_characters <- c("character", "0","unknown","products","services",
                       "including", "source", "offers","range", "related")

#create dataframe of each word with frequency 
stopwords_removed_freq <- stopwords_removed %>%
  filter(!word %in% remove_characters) %>%
  group_by(word) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  ungroup()

#generate word cloud 
set.seed(1234) #for reproductibility
wordcloud2(data=stopwords_removed_freq, size=1.6, color='random-dark')

Observations:

Fish remains the top keyword for the word cloud. However, there are multiple words that are not grouped within the fish categories (E.g. Shellfish, Tuna, Crab, Squids)
Given that the knowledge graph should focus on fishing related companies, the word cloud identified words that are not affiliated with fishing. Words such as bags, shoes, metal, power have higher frequency count. Therefore, they will be further categorized into broader group.

7.2 Grouping Words into Categories

With the aid of the word cloud, we are able to group words into various categories. We begin by creating a custom stop words data frame to remove generic words that we deemed as unsuitable. Words like additional, range may not be part of the stop_word(). While creating the new data frame with a category column, we filtered away numbers and special character that appears in the word column.

grepl() is used to extract words that could easily be extracted (E.g. fish) while mutate() is used to match word in a list. We identified 17 categories, ranging from Fish, Other Food, Transportation to Apparels and Fashion. Unidentified word are filtered away from our analysis.

Show the code

#custom stopwords list to remove generic words
toremove_stopwords <- c("character", "0", "unknown", "yellow", "pink", "red", 
                        "â","added", "adding", "additional", "additions",
                        "addition", "aid", "aids", "air", "related", "including", 
                        "range","products","products","service", "services", 
                        "source", "offers","range")

#new df with category column 
stopwords_removed_addcategory <- stopwords_removed %>%
  filter(!word %in% toremove_stopwords) %>% #filter list without generic wordd
  filter(!grepl("\\d", word)) %>%  #to remove numbers
  filter(!grepl("[^[:alnum:]\\s]", word)) %>% #to remove words with symbols
  mutate(category = case_when(
    word %in% c("automobiles", "automobile", "car", "cars", "bus", "buses", 
                 "trucks", "truck", "trucking", "van", "vans",
                "aircraft", "airjet", "airfrieght", "airline","airlines",
                "airport","vehicle", "vehicles","powertrain",
                "transportation", "trains", "automotive", 
                "freight","transport") ~ "Transportation",
    
    grepl("fish", word) ~ "Fishing",
    word %in% c("pangasius", "angunagas","seafood","seafoods",
                "shrimp", "shrimps", "tuna", "tunas", "pollock", "fin", "roe",
                "fillet", "haddock", "cod", "hake", "saithe", "scallops",
                "arrowtooth", "flounder", "chum", "salmon", "mahi", "tilapia",
                "crab", "oyster", "scallop", "clam", "cephalopods","eel",
                "squid", "loligo", "illex", "cuttle", "abalone", "prawns",
                "lobster", "clams", "octopus", "oysters", "crabs", "halibut",
                "crustaceans", "sockeye", "trout", "herring", "sardines",
                "mackerel", "mussels", "mollusks", "lobsters", "molluscs",
                "caviar","seabass", "shark", "yellowfin","fillets") ~ "Fishing",
    
    word %in% c("oil", "oils", "gas", "cars", "gasoline", "energy",
                "evs", "electric", "biodiesel", "turbines", "steam",
                "power", "plants", "acidizing") ~ "Oil & Gas",
    
    word %in% c("chrysogaster", "insect", "insecticides", "pest",
                "pesticides") ~ "Insects",
    word %in% c("meat", "meats", "chicken","beef", "food", "lamb", "foods",
                "pork", "vegetables", "fruit", "vegetable", "salad", "tea",
                "fruits", "poultry","rice", "cakes", "sugarbeets",
                "bakery", "lambs", "lamb's", "meatballs", "bread", "cream",
                "crabmeat", "steak","soup", "tomato","cheese", "chocolate",
                "burger", "corn","pasta") ~ "Other Food",
    
    grepl("beverage", word) ~ "Beverages",
    word %in% c("milk","milks","soymilk", "tea", "teas", "coffee", "juices",
                "water","drinks","drink","juice", "wine","wines") ~ "Beverages",
    
    word %in% c("production","manufacturing", "manufacture") ~ "Manufacturer", 
    word %in% c("dance", "zumba", "yoga") ~ "Dance", 
    word %in% c("abrasive", "abrasives", "metal", "materials", "material",
                "steel", "plastic", "plastics", "paper", "rubber", "iron",
                "sealants") ~"Materials",
    word %in% c("accessories", "accessory") ~ "Accessories",
    word %in% c("acid", "acids", "deacidification") ~"Chemicals",
    grepl("chemical", word) ~ "Chemicals",
    grepl("animal", word) ~ "Animals",
    word %in% c("tech","technology", "technologies", "nanotechnology", 
                "biotech") ~ "Technology", 
    word %in% c("explosive", "explosives", "pyrotechnics",
                "combustible", "powders", "gun") ~"Explosives",
  
    word %in% c("shoes", "adidas", "footwear", "apparel", "bags", "shirts",
                "clothing", "fabric", "fabrics","fashion", "textiles", "bag",
                "boots","garments","jewelry", "beauty", "wear", "belts",
                "slippers", "gloves","caps","dress", "short", "apparels",
                "skirts","cosmetic","garment", "jackets", "socks", "shoe",
                "pants", "cosmetics") ~"Apparel & Fashion",
    
    word %in% c("toys") ~ "Toys",
    word %in% c("pharmaceutical", "construction", "leather", "dental",
                "stationery", "textile", "building", "optical", "researcher",
                "pharmaceuticals", "engineering", "freelance",
                "management") ~"Other Industries", 
    TRUE ~ "other")) 

#create new df to to filter away others 
stopwords_removed_new <- stopwords_removed_addcategory %>%
    filter(category != "other" ) #filter for all except other


#output for dataframe using knitr:: kable
kable(head(stopwords_removed_new), "simple")

id	country	type	revenue_omu	word	category
Jones LLC	ZH	Company	310612303	automobiles	Transportation
Coleman, Hall and Lopez	ZH	Company	162734684	cars	Transportation
Coleman, Hall and Lopez	ZH	Company	162734684	trucks	Transportation
Coleman, Hall and Lopez	ZH	Company	162734684	vans	Transportation
Coleman, Hall and Lopez	ZH	Company	162734684	buses	Transportation
Aqua Advancements Sashimi SE Express	Oceanus	Company	115004667	chemicals	Chemicals

7.2.1 Word Extraction

While generating the categories above, we created the check_text data frame to cross check on high frequency words. After running the code chunk above, the identified list below helps to examine the words that might have been missed.

As such, we filter() only for category in other. Given that we are only interested in word with high frequency, we did a frequency counts and examine through the top-down approach.

Show the code

#create a mini table for this 
check_text <- stopwords_removed_addcategory %>%
  filter(category == "other") %>% #to filter unidentified categories 
  group_by(word) %>%              #by grouping the word
  summarize(count = n()) %>%      #and doing a frequency count 
  arrange(desc(count))            #sort in descending order

#output for dataframe using knitr:: kable
kable(head(check_text), "simple")

word	count
frozen	467
equipment	309
fresh	276
systems	180
industrial	164
canned	163

Notably, the word above remain unidentified as they might be too generic or they could potentially belong to more than one category. For instance, Frozen could be applicable to Fishing, or to other type of Food. Thus, it is better to remain unidentified as the keyword in the product/service would still be identified.

7.3 Distribution of Identified Word in each Category

A treemap displays hierarchical data as a set of nested rectangles, where each group is represented by an area that is proportional to its value. Based on our categorization, we would like to look into the distribution by our design category and the word involved.

Based on the frequency count, we are able to identify categories that are not part of the fishing. We will aggregate from stopwords_removed_new data frame by using the [dpylr] package:

group_by() is used to aggregate it by category and word
summarize() is used to compute the count
arrange(desc()) to sort weight in descending order

Show the code

#create an aggregated dataframe by category and word 
clean_text <- stopwords_removed_new %>%
  group_by(category,word) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

#treemap saved under 'it' object 
it <- treemap(clean_text,
        index=c("category","word"),
        vSize="count",
        vColor="count",
        type ="value",
        algorithm = "pivotSize",
        sortID = "count",
        palette="Blues",
        border.lwds = "white"
        )

We build our treemap by inputting our aggregated data frame (clean_text) into the treemap() function and saved it as an object called it. Thereafter, the d3tree function from the [d3treeR] package is used to build our interactive treemap.

d3tree(it, rootname = "Categories of Business")

Observations:

Multiple words that are associated with Fishing has been identified, with Fish and Seafood being the main contributor to the category.
It is concerning that there are multiple category that are not related to Fishing. Apart from food related category, Apparel and Fashion have one of the highest count with the main contributor from shoes, footwear, bags, and apparel.

Note about Treemap

index vector shows the category followed by the associated word.

vSize vector shows the distribution by count (the size of the rectangles)

vColor vector shows the different intensity by count

type vector reflects the value type of the treemap

algorithm vector to pivot by size

sortID vector to determined the order from top left to bottom right

palette vector paints the treemap by using the [rColorBrewer] package

border.lwds vector paints the border width to white

7.4 Analyzing Apparel and Fashion

In this section ,we will analyse more about the Apparel and Fashion category to see the businesses in this group are similar in nature. We begin by creating a new dataframe - apparel_type by aggregation. An interactive pie chart is used as the number of categories are small and we suspect that the differences among the type will be significant.

Distribution by Type
Distribution by Country

Observations:

Notably, there are only company in Apparel and Fashion.
The distribution by countries are spread among the countries, with higher concentration in ZH, Oceanus and Marebak. This is in line with the distribution in section 5. If the business groups have to further aggregated, it is ideal to look into the country after its categories.

Tip

reorder command help to reorder the columns in ascending and descending order.

- represents descending order

+ represents ascending order

However, there is a coord_flip() in the code which flip the command.

7.4.1 Interactive Data Table

As identified, there are 721 entries in the interactive data table with 431 unique companies that have product/services that are related to Apparel and Fashion. Further analysis might be needed to measure the similarity of businesses.

Show the code

#aggregate by category 
stopwords_removed_new_af <- stopwords_removed_new %>%
  filter(category == "Apparel & Fashion") %>%
  select(id,country,word) %>%
  distinct() %>%
  ungroup()

#output for dataframe using datatable:: dt
DT::datatable(stopwords_removed_new_af)

8. Conclusion

In our exercise, we observed the following:

Abnormality in business groups based on the type. From the network graph and point of contact, we identified multiple people with high ownership/ large connections.
Multiple words that are not associated with fishing related business. The identified keyword are distinct and could be categorized into broader group.
Apparel and Fashion is one of the largest category that mainly includes companies and they have wide operation in various countries.

References:

Chatterjee, S. (2019, January 10). Create a Word Cloud with R. Towards Data Science. Retrieved June 11, 2023, from https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a

R4VA. (2023, June 4). 27. Network Metrics Analysis. Retrieved June 11, 2023, from https://r4va.netlify.app/chap27.html#network-metrics-analysisr

VAST Challenge. (2023). MC3: VAST Challenge 2023 - Mini Challenge 3. Retrieved June 09, 2023, from https://vast-challenge.github.io/2023/MC3.html