GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310 by MLOps.community | Podchaser

Episode from the podcastMLOps.community

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310

Released Tuesday, 29th April 2025

Good episode? Give it some love!

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310

Tuesday, 29th April 2025

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

So I'm Wei. My full name

0:02

is actually Wei Dong Yang, but Wei

0:04

is easier to pronounce. And

0:06

I'm CEO for cannabis at

0:08

Data Analytics, a visual data

0:10

analytics company. And

0:12

I love coffee.

0:14

I think the civilization starts with

0:16

the invention of coffee. So I have to

0:19

drink a coffee. I do

0:21

add milk to coffee because the black

0:23

coffee is a little bit too strong

0:25

for me. Welcome back

0:27

to another in the lobs community

0:29

podcast today We are lucky enough

0:31

to have not one but two

0:33

graph experts who have been doing this

0:35

for a very long time. I

0:37

got schooled I felt like I

0:39

learned a ton about how to

0:41

use graphs as Tools and ways

0:43

that we can leverage them better. Let's

0:45

get into this conversation with Paco

0:47

and way as always I'm your

0:49

host Demetrios and You know what

0:52

is a huge help if you

0:54

can hit little review and

0:56

whatever you are listening to this on that

0:58

would mean the world to me boom

1:00

let's jump into it and oh

1:02

yeah if you are one

1:04

of those people that is listening

1:06

on a podcast player I

1:08

have got the recommendation for

1:11

you for our

1:13

music record this

1:15

is thanks to

1:17

one of the people

1:20

in the community, Lee Wells, who

1:22

just joined and now whenever someone

1:24

joins the community, I ask them

1:26

what their favorite music is. Today,

1:29

we're listening to We Are One by Maze.

2:30

We're talking about PII and

2:32

using different methods to anonymize

2:35

data, right? And Paco, you

2:37

had said something that I

2:39

didn't fully understand, and then

2:41

Wei, you said something else that I didn't fully

2:43

understand, so maybe we can rehash that and I

2:45

can understand it the second time. Awesome.

2:50

Well, I was going to ask if you

2:52

all ever came across, there's another podcast

2:54

that I followed called The Dark Money Files,

2:57

and it's people who There's

3:00

a couple of consultants who have worked in

3:02

banks and understand a lot of the

3:04

ins and outs of financial crimes

3:07

and investigations. And

3:10

so I was just gonna preface it because

3:12

they've had a great series recently. If

3:14

you've ever heard of this thing called the

3:16

SAR, it's a suspicious activity report. And

3:18

the laws are really weird depending on

3:21

what country the bank is in. But

3:23

basically this, if you're at a

3:25

bank and you see some suspicious activity, like

3:27

there's a money transfer, and the

3:29

counter party is like a known

3:31

terrorist group or something, you see

3:33

something weird going on. Okay, number

3:35

one, you have an obligation to

3:37

report a crime to a criminal

3:39

investigation unit. If

3:41

you see something suspicious and you don't report it,

3:43

that's a crime. If

3:45

you see something suspicious,

3:48

you have not an obligation,

3:50

but a responsibility to send it up

3:52

the chain so that other financial houses

3:54

might share. If

3:57

you send too much information, you might

3:59

get sued. And then

4:01

so there are these reports and it

4:03

usually costs on average about $50 ,000 to

4:05

process each report. So you don't want to

4:07

generate too many of them. And like

4:09

machine learning models could generate thousands per day,

4:12

which would be like, you know, tens

4:14

of millions of dollars of liability. So this

4:16

whole space of like, what do I

4:18

do? I'm getting, I'm getting attacked. And what

4:20

do I do? Because I mean, also

4:22

these people are taking money and you might

4:24

have to. under some

4:26

situations as a bank, you might

4:28

have to compensate if there

4:30

is some kind of scam. So

4:33

you could be losing money and

4:35

facing like legal threats from three sides.

4:38

And meanwhile, there's this thing called

4:40

a SAR. And like, I've actually been

4:42

yelled at for asking what I was supposed to

4:44

integrate with something. And I was like, can I see

4:46

what the scheme is like? No, you're not allowed

4:48

to, no, it's too confidential. So it's like, it's

4:51

just this whole tangle of worms about

4:53

how to What

4:56

do you actually do with once

4:58

you have evidence of financial

5:00

crime or even suspicion of it?

5:03

What next steps you take are really

5:05

tangled. And I think, Wei Dong,

5:07

you probably have a lot more experience

5:09

about this in certain theaters too. I

5:13

have some similar experiences where

5:15

even the schema is

5:17

not allowed to see because

5:19

the schema may actually

5:21

reveal some... secrets or certain

5:23

activities may become liable to

5:25

certain parties. So that

5:28

can be pretty tricky. And

5:30

so it basically gives

5:32

away information that if you

5:34

were looking at it, you now, because

5:37

you know the schema, you can

5:39

guess a few other. parts of

5:41

this puzzle and get information that

5:43

people don't want out there. The

5:45

banks are using a lot of

5:47

data that come from providers. There

5:50

may be other cases where

5:52

there's data that's coming from,

5:54

say, public sector agencies, crime

5:56

investigations. There may be

5:58

intelligence reports, and so there may be

6:00

parts of the schema that are highly sensitive

6:02

and only certain people are allowed to see.

6:05

But you were saying

6:07

that with graphs...

6:10

anonymizing that PII, you're

6:12

still able to gather insights,

6:14

right? Yeah, that was

6:16

cool. We were just in a talk

6:18

and Brad Corey from Nice Actimize

6:20

was showing where like they're preparing to

6:22

do RAG and they were using

6:24

I think Bedrock and they know that

6:26

they've got a hot potato. They

6:29

know they've got a lot of customer

6:31

PII that just can't go outside

6:33

the bank. So what they

6:35

were doing is substituting PII with

6:37

Unique identifiers, they generate tokens, they

6:39

generate on the fly, and then

6:41

they make the round trip after

6:43

they've run three LLMs and made

6:45

a summary, and they replace

6:47

the tokens with the highly

6:50

confidential material they just

6:52

have internally. And so this

6:54

is a way of being able to use

6:56

some sort of external AI resources, AI resources, but

6:59

still manage a lot

7:01

of data privacy. Yeah,

7:06

I've seen it with we had

7:08

these folks on here from tonic

7:10

AI and they were talking about

7:12

how they would use Basically the

7:14

same Information but swapping it

7:16

out. So if it is

7:18

someone's name, they just changed

7:20

the name So it went

7:22

from Paco to John and if

7:24

it is a social security number,

7:27

they would swap out the social

7:29

security number and totally randomize the

7:31

number But it still is

7:33

a social security number. So you,

7:35

at the end of the day, you

7:38

get almost like this double blind.

7:41

So even if you're a data scientist

7:43

looking at the information, you can understand

7:45

it. But you don't

7:47

know if it is the

7:49

true information that's going to reveal

7:51

that PII. Interesting.

7:55

Interesting, yeah. Although

7:58

I do see situations where Even

8:02

the structured the document itself

8:04

is gets revealed it revealed

8:06

Information that you do not

8:08

want people to know like

8:10

you reviewed it like in

8:13

the investigation space very often

8:15

you do not want people

8:17

being investigated know that being

8:19

investigated But certainly information even

8:21

the structure is you structure

8:23

the document being reviewed can

8:25

become a problem So so

8:28

part by some point I

8:30

felt like the

8:32

in -house on -prem

8:34

LLM might

8:36

be necessary, especially

8:39

just red news

8:41

that the M3

8:43

Ultra Studio with

8:45

the 500GB RAM,

8:47

Karan large -dunk

8:49

remodels at 20

8:51

tokens per second,

8:53

that could potentially

8:55

be an interesting

8:58

solution for that. Yeah,

9:00

I mean for for our end

9:03

use cases, you know like 60

9:05

% of those are air -gapped and

9:07

so Yeah, you know the largest

9:09

chunk of that they're they're gonna

9:11

be a lot of like public

9:14

public sector agencies running in skiffs

9:16

So they they can't do any

9:18

data out. Yeah And

9:21

there's good news for running

9:23

really interesting LLMs on local

9:25

hardware. There's a lot of

9:28

really good news. I will

9:30

shout out to my friends

9:32

over at Useful Sensors, Pete

9:35

Wharton and company, I'll put that

9:37

in the chat. You

9:39

can do a lot with local hardware.

9:42

What are they doing? Useful

9:45

sensors, so Pete

9:47

Wharton. and

9:50

Mentoreth Kudler, they were part

9:53

of the TensorFlow team at

9:55

Google. And for

9:57

I think like eight years, they

9:59

evangelized use of deep learning inside

10:01

of products at Google, like internally. And

10:04

then they left and the team, has

10:06

a startup in Mountain View now. And what

10:08

they're showing is, hey,

10:10

here's like $50 worth of hardware. Here's

10:13

an ARM chip with a neural network

10:16

accelerator on it. And we can run 3L

10:18

LMS on battery power. So

10:20

it's pretty cool because they

10:22

came out of like the tiny

10:24

ML, I don't know if you've ever seen the

10:26

conference. Oh, yeah. And

10:29

so. You know,

10:31

this is a lot of the

10:33

specialty that Pete has. And,

10:35

you know, he

10:37

was on the CUDA

10:39

team at NVIDIA before. So,

10:42

I mean, these folks really

10:45

know how to make AI infrastructure

10:47

run on hardware, and particularly

10:49

how to handle a lot of

10:51

low power and low latency

10:53

kinds of situations, and

10:55

where to punch through the

10:57

bottlenecks. You don't necessarily have

10:59

to have a ginormous GPU

11:01

cluster, although in some cases

11:03

it helps. But especially when you're

11:05

running inference, you can be running on much

11:08

lower power and doing really interesting things out

11:10

in the field. So wild

11:12

now, I know that

11:14

we had originally wanted

11:16

to chat a bit

11:18

about this idea that

11:20

I think way you

11:22

had proposed and it's a

11:24

little bit of a

11:26

a differentiation on Graphrag

11:28

and so maybe you

11:30

can set the scene for

11:33

us because yeah, I

11:35

want to go deeper

11:37

there Yeah I

11:39

run in danger

11:41

of pulling way

11:44

far. Fundamentally,

11:48

I think with LLM,

11:51

whole machine processing information has

11:53

changed. Before

11:55

LLM, everything is

11:57

exact, symbolic, like

12:00

matching all the APIs, all

12:02

the rigid data

12:04

structures. Just think about...

12:08

deep blue when beat the

12:10

chest, everything is rigid

12:12

knowledge as rules and

12:14

things. L .O .M.

12:17

changed everything because L .O .M. started

12:19

to understand things in the

12:21

contextual base, started to

12:23

understand fuzzy things. And

12:25

it suffers the same

12:27

weakness of a human

12:29

being, not exact, like

12:31

we glide over information,

12:33

we draw conclusions, we

12:35

make lips. make

12:37

jumps. But at

12:39

the same time, LLM's

12:42

ability to reason like

12:44

humans, that for me

12:46

is fundamentally changed how

12:48

we approach the computing. And

12:52

so in

12:54

applying LLM to

12:56

analyze documents, my

13:00

analysis is now we

13:02

can let LLM work more

13:04

like humans. rather than

13:06

like machine, we understand in

13:08

the past, that also

13:11

implies what the data structure

13:13

is preferred for LLM, which

13:15

I would argue that

13:18

a data structure, a data

13:20

management that preserves as

13:22

much contextual information as possible,

13:24

preserves as much nuance

13:26

as possible, that the

13:29

subtle nuances may come out

13:31

to be important. So

13:34

so I use the example

13:36

of my wife is Brazilian

13:38

the American tourist to Brazil

13:40

gets invited to a horse

13:42

party says the party start

13:44

at 6 p .m. So

13:46

so as a good American

13:49

guy show up promptly on

13:51

time at 6 p .m. And

13:53

the hostess comes out still

13:55

wrapped in the shower shower

13:57

tower and totally confused and

13:59

So right And the turnout

14:01

over there when this is

14:03

6pm is where the hostess

14:05

start thinking about the party,

14:07

start like going out shopping,

14:09

preparing food and getting ready.

14:12

And the people usually don't show up until like

14:14

two or three hours later. And

14:16

a bad culture difference. Yeah,

14:19

if we try to capture

14:21

that in a knowledge graph, what

14:24

kind of construct allows us

14:26

to capture those subtle cultural

14:28

nuances there? And that might

14:30

become important in understanding the

14:32

document later. So I think

14:34

that's the challenge. Yeah. Parkoo,

14:37

you want to add something there? Let's

14:39

hear what you think. Well,

14:42

from a perspective of natural

14:44

language, something that the models

14:46

bring in, but it's kind of a

14:48

nuance and I don't think it's talked

14:51

about a lot. There's a very recursive

14:53

nature to how we as people talk

14:55

with each other and tell stories and

14:57

share information. We do reference

14:59

it in the sense of like going down

15:01

the rabbit hole. Like if you follow

15:03

a thread too far, you're kind of going

15:05

down the rabbit hole. And there's this very recursive

15:07

nature of how we think and especially how

15:10

we express. It certainly comes across

15:12

in written language, although we tend

15:14

to think of written language as

15:16

something linear. There's paragraphs and sentences,

15:18

and it can all be diagrammed.

15:20

But when you look at the

15:22

actual references that are inside of

15:24

those sentences, they're making recursive calls

15:26

throughout a story, throughout

15:28

somebody's speech or throughout

15:30

a book. And

15:33

we can try to linearize that and

15:35

come up with an index or a bibliography,

15:37

but at the end the day, it's

15:39

a graph. And you get this

15:41

very self -referential thing in any text. And

15:43

this is something that the LLMs have

15:45

really, I think, pulled out. And

15:48

we were also just part

15:50

of the talk we were

15:52

in, also Tom Smoker from

15:54

Why How is Showing about

15:57

how they leverage ontology, they

15:59

leverage schema, and chase after

16:01

information recursively. So

16:04

that's just another kind of view

16:06

on this, but I, way

16:09

I love how you all

16:11

are approaching this. You have a

16:13

very powerful view of kind

16:15

of relaxing the constraints upfront, but

16:17

then having the context propagated

16:19

through. I realized there's an important

16:21

philosophical approach difference between East

16:23

and the West. And

16:26

the Eastern philosophy very much drive

16:28

towards the nature of things. And

16:30

it's important, which is that

16:32

that's very curiosity about nature of

16:34

things, that the desire to

16:36

have a definitive definition of nature

16:39

of something is led to

16:41

the great scientific discovery over the

16:43

past several hundred years. The

16:46

Eastern philosophy very much on

16:48

the outside is focused on the

16:50

contextual, focused on shifting, changing

16:52

nature of things. Like

16:54

the Chinese Bible, the Daoism, the

16:56

Bible, Dao De Jing, the

16:58

first verse it says Dao Ke

17:00

Dao Fei Chang Dao means

17:02

if you name something, you get

17:04

it wrong. Or it's

17:06

not permanent. It's

17:08

really focused on impermanence of things.

17:10

It focuses on everything changes

17:12

nature in context with other things.

17:15

So that is essentially a

17:17

graph. Now,

17:19

you're putting both things together.

17:22

So, okay, I have to

17:24

say that that attitude towards

17:26

like, oh, everything changes. Thus,

17:28

we cannot see anything. Thus,

17:30

everything is fuzzy, is very

17:32

much contribute to the Chinese

17:34

technology science developed very far

17:37

in about a thousand years

17:39

ago and stalled. And

17:41

a lot of its attribute to this

17:43

like philosophical like

17:45

things like reduce a lot

17:47

of curiosity and drive down deeper

17:49

into the nature of things.

17:52

However, in practical things, there's

17:54

some practical application of that

17:56

approach, which in today with

17:58

LAM and graph, we really

18:01

see that it's like a

18:03

great combination of you allow

18:05

certain things to be drilled

18:07

down to be very definitively

18:09

defined, to be clearly defined

18:11

within the context. But

18:13

a lot of information,

18:15

contextual information, stay fuzzy.

18:19

So in fact, I feel

18:21

like I'm really excited about

18:23

integrating sensing and our graphics

18:25

are a kind of a

18:27

solution together because the sensing

18:30

helps to drive this definitive

18:32

part. Once you have

18:34

the definitive part, drill it

18:36

down. named, defined, it really

18:38

speeds up to make a

18:40

lot of assessment fast, definitive,

18:43

and precise, which is crucially

18:45

important. But on the

18:47

other hand, you allow

18:49

this loose structure of information

18:51

decomposed as a graph

18:53

that you can easily retrieve.

18:56

and without losing the nuances, the

18:58

subtleties, like in the in the

19:00

in the cultural differences, things like

19:03

you still preserve that. So don't

19:05

think come together. I feeling is

19:07

the one how you how you

19:09

want to grant LLM to protocol

19:11

to create a precise accurate and

19:13

no the limit, no when it

19:16

does not know not to make

19:18

a judgment. I think that's also

19:20

very very important. So in my

19:22

mind is the graph and AI

19:24

right now is present opportunity to

19:26

allow this Western way of Drive

19:28

to the nature of things and

19:31

Eastern way of focus on the

19:33

contextual information Come together to work

19:35

together to solve practical problems So

19:37

so very well said and you

19:39

know the the challenge we face

19:41

is we don't really know what

19:43

the downstream application will be Like

19:46

we're doing investigation. We're doing some

19:48

kind of discovery whether you're trying

19:50

to find you know Money launderers

19:52

or whether you're trying to find

19:54

you know, who's my best customer

19:56

for this hotel? It's a discovery

19:59

process and by nature of discovery

20:01

You don't know what the answers

20:03

are in fact in a complex

20:05

system. You don't even know where

20:07

or how just you know, it's

20:09

unknown unknowns, right? so by

20:11

preserving that context then you are

20:14

sort of fortifying yourself so

20:16

that When the time presents, you'll

20:18

be able to make the

20:20

right discoveries. You won't

20:22

have cut them off in advance. I

20:24

think if you go back to

20:26

before relational databases came out, you

20:28

go back to some of the

20:30

earlier writings from Ted Codd,

20:32

and one of his colleagues was

20:35

William Kent, who did... a book

20:37

called Data and Reality. If

20:39

you go back to some of

20:41

the early like 1970s thinking about data

20:43

management, it's really interesting to see

20:45

where the lines are drawn because in

20:47

this Western view, so much

20:49

of data management was about, let's

20:52

have a data warehouse, let's

20:54

pretty much throw away the relationships, let's

20:56

focus on the facts. We

20:58

have a lot of, as we were saying,

21:00

a very Western view of like, I

21:02

just want to know like millions of facts

21:04

and I will piece them together with

21:06

a query. I'm not, yeah, I'm not really

21:08

interested in preserving the context. So, I

21:10

mean, I think we have a long history

21:12

from like data warehousing of going too

21:14

far on the Western side. Well,

21:18

what is interesting to me

21:20

is the conversation that we

21:22

had with Robert Caulk on

21:24

here probably three months ago,

21:26

and how he said, we've

21:29

completely thrown out ontologies. And

21:31

for his specific use case, That

21:33

isn't the way that they wanted

21:35

to go. And I

21:37

wonder if you guys have thought

21:40

through that and what that looks

21:42

like, what the benefits are, and

21:45

is it one of these

21:47

things where you potentially are experimenting

21:49

on those levels too? In

21:51

my perspective, ontology is important, but

21:54

you have to know the boundaries. I

21:57

give a parallel into all the

21:59

theory in the physics theory, like Newton's

22:01

law. Newton's law is

22:03

important. It captures important truth

22:05

in the nature. However,

22:09

just like any physicist,

22:11

any physicist's theories, the

22:13

moment when the theory is as proposed, it's

22:16

a very important fact. Important concept

22:18

is you're waiting to be disapproved.

22:21

So you never accept as

22:23

the truth of everything. You

22:25

have a theory. Park

22:27

was a math scientist, so I think

22:29

he's also very familiar with the

22:31

concept. When you propose a theory, be

22:33

test true, but you're always

22:35

looking for situations, looking for the

22:37

boundaries where the theory will stop

22:39

to be true. So I

22:41

don't think ontology is

22:44

anything different. It's just like

22:46

ontology needs to be

22:48

very well -grounded. The contextual

22:50

context needs to be defined.

22:52

And within this context,

22:54

this ontology knowledge

22:57

is real. It's truth.

23:00

The problem I see as a

23:02

lot of traditional knowledge graph

23:04

approach is people ignore the fact

23:06

that ontology has to be

23:08

confined within a specific domain. The

23:11

moment you step out of the domain,

23:13

you have problem. But

23:15

the other thing is, we think

23:17

this domain ontology is fantastic. It

23:20

helps you to solve problems

23:22

so much faster, so much precise.

23:25

But again, as long as you

23:28

can define the boundaries, define the

23:30

domains, it's great. What

23:35

Rob Kolk and Ellen Tornquist

23:37

and others at Ask News,

23:39

what they're doing is they're

23:41

looking at news sources, especially

23:43

regional news sources across the

23:46

world, and they

23:48

really are finding hard

23:50

evidence, groundbreaking

23:52

evidence on the ground, literally,

23:54

if you're doing ESG. uh work and

23:57

you're trying to do diligence on a

23:59

company or a set of suppliers and

24:01

you want to find out like what

24:03

are their operations really like over in

24:05

that other country where they're based and

24:07

then you find out they're engaged in

24:09

like I don't know child labor or

24:11

something and you know you you want

24:13

to make other arrangements before your shareholders

24:15

find out um so I think with

24:17

Ask News you know they're out and

24:19

they're looking they're working with those publishers

24:22

and they're they're collecting that news and

24:24

representing it in a graph And

24:27

yeah, as

24:29

you were saying, I mean, an

24:31

ontology, ontologies really don't work across

24:33

domains. You really want to focus

24:36

more on like closed world within

24:38

a domain, having a

24:40

full enterprise wide ontology, nice

24:42

idea, but I rarely see it work. And

24:45

I think that in the case

24:47

of like understanding news reports in

24:49

the world, you don't know what

24:51

the domain is in advance. You

24:53

only know this is what is being

24:55

published. And so I think

24:57

by relaxing that constraint at Ask News,

25:00

they're able to come up with a

25:02

graph of like, here are things that

25:04

are related. You can follow this evidence

25:06

and you can find more historically about

25:08

this area. I

25:10

think those are very important, but

25:12

ultimately it will be shaped

25:14

by some kind of context, some

25:16

type of shared definitions. And

25:19

ontology is really more about sharing definitions

25:21

and making sure we're you know, describing the

25:23

same thing because I swear, you go

25:25

to a big company, use the word customer

25:27

in front of one VP, you

25:29

know, in sales, it means something different

25:31

to like the VP in charge of

25:33

procurement. So even like the

25:35

words themselves don't cross domains. The

25:38

graph is Basically our idea that we

25:40

know that there's connections like if you if

25:43

you do have your your operations data

25:45

But then you also have your your like

25:47

sales data, you know, there's some connections

25:49

across there It's not exactly the same but

25:51

some stuff is connecting so graphs show

25:53

where those connections are But I think you

25:55

know think about like the example of

25:57

Google Maps like there's different levels of detail

25:59

and of course any video game of

26:02

course has this too but you know if

26:04

you're taking satellite data and like trying

26:06

to stitch together a map you zoom in

26:08

you can see the beach and you

26:10

zoom in you see the car tracks and

26:12

you zoom in further at some point

26:14

you're gonna get to pixels right yeah and

26:16

you zoom out and maybe you see

26:18

this landscape of like a beach next to

26:21

the ocean but then probably you zoom

26:23

out at some level and they've got like

26:25

the name of the beach Right. So

26:27

there's like a high level detail. I think

26:29

graphs are much the same. There are

26:31

connections at the low level, like Ask News

26:33

is saying is like, you

26:35

know, here's a reporting from Zimbabwe.

26:38

This is like the reporters on the ground.

26:41

But then you zoom out and you're like,

26:43

okay, well, you know, what impact does

26:45

this have on our supply network? Do

26:47

we have to really make different plans? Is there

26:49

going to be like a war breaking out that

26:51

causes, you know, all those shipping containers to be

26:53

delayed by three months? I

26:56

think at some level you need

26:58

to think of the graphs as

27:00

sort of collecting higher and higher

27:02

into more abstracted, more refined

27:04

concepts, if you will. And

27:06

so the stuff at the low level is kind

27:08

of like, let's see how it all fits together. The

27:11

stuff at a higher level, it's like, oh, actually,

27:13

we can maybe do some inference on this, or we

27:15

can use this to help structure other data that

27:17

we're going to piece together. So,

27:21

Demetrius, you actually touched

27:23

up a really big subject

27:25

that thinks... Now, in

27:28

the exploratory

27:30

process, it's combined

27:33

with the questions. Knowing what

27:35

question to ask often is 80

27:37

-90 % of the work. So,

27:40

a prescribed thing to

27:42

give you the answer often

27:44

meets the point, or

27:46

meets the important subtleties. But

27:49

the problem is how

27:51

do you discover the question

27:53

you need to ask?

27:55

And so in the way

27:57

that our perception, our

27:59

visual perception, our brain is

28:01

a fantastic... I don't want to

28:03

call it a machine, or

28:05

I don't want to even call

28:07

it a tool, but has

28:09

this great power of see patterns

28:11

in the information. Like

28:13

we look out in the sky,

28:15

we see the cloud, we

28:18

have some... we have some kind

28:20

of, like you are a

28:22

performer, I look at your performance,

28:24

your dance, like the information

28:26

being expressed without being able to

28:28

verbalize it, to define it,

28:30

but you have to watch it

28:32

to feel that. Maybe

28:34

you watch it long enough, you stop be able to

28:36

describe it, you stop be able to say, oh,

28:38

this is, some things

28:40

is there. So in a way

28:42

that what the graph does is

28:45

the graph is a fantastic medium

28:47

for visualization. You look

28:49

at the information express it just

28:51

like how I will bring like

28:53

when we think about you Dimitrius

28:55

I immediately think about Paco because

28:57

we in the same part in

29:00

room together so that's association. So

29:02

this association of

29:04

multiple pieces of information entities

29:06

in the space, if you

29:08

visualize effectively, it helps you

29:10

to see the patterns, help

29:13

you to see all the

29:15

missing links, missing patterns, things

29:17

that get our attention. And

29:19

then we start be able

29:21

to formulate the question, to

29:24

formulate, to

29:26

answer the question. More

29:30

than a tabular data

29:32

structure, I have to say,

29:34

the graph really helps

29:36

us to engage our brain

29:39

in this way, to

29:41

spot important information. Just go

29:43

watch a dance performance. You

29:46

see something definitive

29:48

happening, but you

29:50

know it before

29:52

you engage your language

29:54

or ecological thinking. Afterwards,

29:58

things, concepts start to form,

30:00

and then you can start to

30:02

build things around it. Oh, dude.

30:06

How cool is that? You know

30:08

it before you can express it

30:10

in that way. Absolutely. I

30:12

think a lot of analytics workflow

30:14

is work the other way around. We

30:16

focus so much on building up

30:18

the queries, build up

30:20

the programs to

30:23

drive it. to

30:25

drive the answer.

30:29

But as Parkoon and we

30:31

in the investigative space,

30:33

we all know that too

30:35

often getting the hint

30:37

is 80 % of work. Like

30:41

if you know that you're being

30:43

attacked, you know that they came in

30:45

through some vector, there's probably some

30:47

set of machines that are compromised. You're

30:50

not seeing that. You're seeing where you

30:52

know, the bad things are happening, stuff

30:54

is being stolen or whatever. So

30:57

looking across your network, just building up a

30:59

graph of like the associations of what's happening

31:01

during an attack, there's some placeholders. There are

31:03

definite questions that could be generated like, which

31:05

machine was compromised? Maybe I should fix that.

31:07

So I think from the operational perspective, you

31:09

know, I mean, you kind of have to

31:11

think of, I mean, we do think about

31:13

that, right? We do think about like, how

31:16

do we identify those unknowns? But the

31:18

problem is that the more complex

31:20

the problem becomes the more that

31:22

those Unknowns are not something that

31:24

can really be charted. They have

31:26

to be sort of poked at

31:28

and explored Yeah, and I think

31:30

that's why way what you're saying

31:33

with the graph being this visual

31:35

medium that we can poke at

31:37

and we can explore and It

31:39

gives us a different perspective with

31:41

which we can work with and

31:43

wrestle with the data is some

31:45

something that I hadn't heard before,

31:47

but it makes complete sense. From

31:50

a historical perspective, in terms of

31:52

data, you know, something to bring

31:54

out would be to consider about spreadsheets,

31:56

because like spreadsheets are sort of my

31:58

go -to example. This is all in

32:00

tabular form. It's very, very sort of,

32:03

you know, left brain. Everything is

32:05

very buttoned down. But the thing about spreadsheets

32:07

that you never see is there is a really

32:09

complex graph behind it, and it only works

32:11

because of that. But they never

32:13

show that. They just show the tabular

32:15

part. But all the real knowledge and

32:17

dynamics and all the real information you're

32:19

capturing a spreadsheet is about those different

32:21

dependencies and how that graph functions. Classic.

32:25

Of course we don't see it, because that

32:27

would be absolute chaos for us. Mind

32:29

blown. The graph is

32:31

this front -text media for

32:33

this perceptive thinking. Well,

32:36

the challenge is like, when

32:38

we talk about graph, I think

32:40

that we need to really

32:42

really like the separate two things.

32:45

Graph in the media of information

32:47

capture and the graph in the

32:49

media to help us to

32:51

think. There are two different

32:53

things. Graph as information capture, the

32:56

sole purpose is to capture

32:58

information as precise as possible,

33:00

as complete as possible. You

33:02

want to capture as much

33:04

truth as possible. However,

33:07

graph as a way of thinking,

33:10

If you take the raw

33:12

graph captured, preserve a

33:14

lot of truth, well, the

33:16

problem is we can only

33:18

hold seven piece information I will

33:20

bring at any given moment.

33:23

We'll be overwhelmed by all those

33:25

graphs. If we think about

33:27

our brain, in that

33:29

way, even the vector

33:31

embedding, I call it an implicit

33:33

graph, because vector embedding gave

33:35

you a medium to compute the

33:37

similarity. Effectively, you

33:40

can construct a graph. Exactly.

33:43

You can manifest a graph

33:45

out of it. So

33:48

you will see that the

33:50

graph being captured at the layer,

33:52

at the stage that's really

33:54

designed to preserve the ground truth,

33:57

as much truth as possible. But

34:00

then you need a way

34:02

to work the data into

34:04

a form that we can

34:06

easily digest with our perceptive

34:08

power. That is a challenge.

34:10

This is also why, in

34:12

my mind, there is a lot of

34:14

graphs. In theory, people

34:16

know the graph is how we

34:18

think. Thus, it's important.

34:21

But in practice, that is

34:23

a barrier. And how do

34:25

you reconcile the need between

34:27

graphs as information capture medium

34:30

and the graph to support

34:32

our perceptive thinking medium? It's

34:34

a very different thing. just

34:39

going back to what you

34:41

were saying with, we can relate

34:43

each other because we're on

34:46

this podcast together. We've done stuff

34:48

together. Maybe there's certain things

34:50

that come up in our memories

34:52

that are going to be

34:54

the most pertinent to that graph

34:56

that we have in our

34:58

head, but it's never going to

35:01

expand more than seven hops

35:03

or seven different parts of that

35:05

graph. Have you

35:07

ever have your worked with there's

35:09

like a kind of I guess

35:11

rubric might be a way to

35:14

say it came out of Carnegie

35:16

Mellon out of CMU Jeanette Wing

35:18

had this idea of What's called

35:20

computational thinking? And so it's

35:22

sort of like a four -step process

35:24

of like breaking down a problem

35:26

and then being abstracted back out It's

35:29

really powerful and I've used a

35:31

lot in courses teaching people but I

35:33

think that there there may be

35:35

something Kind of emerging as

35:37

like graph thinking and so just

35:39

to throw out like a straw

35:41

man here This is kind of

35:43

thinking out loud, but one of

35:46

the things that we see in

35:48

like fin crime in financial investigations

35:50

is a kind of graph thinking

35:52

a four step process repeated over

35:54

and over where you know,

35:56

you you do your best to build

35:58

out this graph and it might have hundreds

36:00

of millions of nodes or billions of

36:02

nodes or some ginormous number, something beyond human

36:04

scale beyond beyond human comprehension. But

36:07

then step two, partition. So

36:10

like, can we break out this

36:12

enormous graph into some areas of

36:14

subgraphs of patterns that are interesting?

36:16

Like, hey, this this looks like

36:19

a really good customer or hey,

36:21

this looks like a money mule.

36:24

you know, fraud scheme. And

36:27

so you go, you do this dimensional

36:29

reduction then because you go from like five

36:31

billion nodes in a graph down to

36:33

maybe 10 or 20 that are interesting. And

36:36

so that's like, there are graph algorithms

36:38

like Louvain or like, you know, weekly commit

36:40

connecting components or there are different ways

36:42

to get down to that scale. And

36:45

in like in machine learning in general,

36:47

we're looking a lot of dimensional reduction,

36:49

right? So, Once

36:51

you've got down to that scale now

36:53

you can use other graph algorithms like

36:55

maybe between a centrality or different forms

36:57

of centrality to understand how are these

36:59

parts connected and Gosh, maybe there's like

37:01

one node in there who's orchestrating the

37:03

whole crime ring Which typically case there

37:06

might be like a person with a

37:08

bunch of shell companies, right? And they're

37:10

doing fraud So that's step three is

37:12

like leveraging certain types of graph algorithms

37:14

to sort of think of page rank

37:16

Let's bubble up to the top the

37:18

parts that are probably first

37:20

good steps to investigate. And

37:23

then step four, put

37:25

it through a work process.

37:27

And I mean, if you're working with people

37:29

in a bank, put it through case management

37:31

tools, you know, a level

37:33

A analyst gets assigned it, they go

37:36

and they start poking around the graph, they

37:38

do something interactive, they work with the

37:40

visualization, and they apply what they've learned. Or

37:43

you may have some agents involved

37:45

there too to help like summarize

37:47

and and and dig up part,

37:49

but it's a workflow So it's

37:51

kind of a four -step process

37:53

of sort of graph thinking if

37:55

you will that can be applied

37:57

and can integrate people and also

37:59

AI technology together Yeah, I want

38:02

to add one more thing to

38:04

Paco said it's really really important

38:06

to be able to narrow it

38:08

down to be a loop identified

38:10

things to reduce, reduce, reduce. But

38:12

there's also another aspect

38:14

which is a simplification

38:17

abstraction. Like very

38:19

often when you capture the data, you

38:21

don't really like the domain or

38:23

you don't need to know the future

38:25

question. So the domain is

38:27

wide. But we look for

38:29

the information and so the domain

38:31

is narrowed. When domain is narrowed,

38:33

for example, like I call Paco

38:35

as a math scientist, at some

38:37

point I can just refer Parko

38:40

as a math scientist. I don't

38:42

need to add information because math

38:44

scientist is Parko. And

38:46

that only valid in the

38:48

specific domain. So

38:50

the reason I say that is

38:52

because a lot of information

38:54

when you domain wide, like

38:57

I call it when

38:59

you capture information, I

39:01

prefer a pure edge

39:03

approach. Like in the

39:05

graph, edge has no

39:07

properties it's just edge it's

39:09

just association anything you need the

39:11

property means the things you

39:13

may need may need to amend

39:15

it up on maybe you have

39:17

something pointed to it or pointed

39:19

out to it you keep

39:21

it as a node now as

39:23

you're thinking very often like I

39:26

know Paco but I know Paco

39:28

this relationship I can carry a

39:30

lot of context in it already

39:32

I don't need additional information to

39:34

to show, to tell how

39:36

I know Pako, it just can

39:38

be in there. I know Pako

39:41

itself is sufficient. So

39:43

what that means is when

39:45

we present like I know Pako

39:47

that relationship as a single

39:49

relationship, right? In

39:51

the data layer, there might be

39:53

a tons, thousands or tens

39:56

of thousands piece of information there,

39:58

but it come out as

40:00

a one single piece of concise

40:02

information. I think

40:04

that is where I

40:06

think an analytic workflow

40:08

or visual analytic workflow

40:10

should be, is to

40:12

be able to go

40:14

from a very detailed,

40:16

broad, big, large information,

40:18

distill or aggregate down

40:20

to a simple representation,

40:22

but is grounded in

40:24

that particular domain, in

40:26

that particular context, so

40:29

for us to, so we can

40:31

communicate. We can communicate in

40:33

simple language rather than carry a

40:35

lot of information when we

40:37

had to. I know

40:39

Paco, that's it. We

40:41

don't need to know how we know each

40:43

other, where do we know each other

40:45

in certain contexts. Is

40:47

it almost like the data

40:49

underneath is like an

40:51

iceberg in a way and

40:53

you knowing Paco is

40:55

like the tip of the

40:57

iceberg. You have that one. Piece

41:00

of information, but then if you

41:02

wanted to get more granular you

41:04

can go down and see the

41:06

whole iceberg Yes way, could we

41:09

could we say then that? You

41:11

know, we pull everything. We connect

41:13

everything together. It's very noisy. We

41:15

can go up different levels of

41:17

abstraction. But to your point then,

41:19

we're going up levels of abstraction

41:21

in particular domains, like for purpose.

41:24

So we have some shared definitions.

41:27

And then we can start to

41:29

say, OK, now let's do our

41:31

Louvain partitioning or whatever. Then we

41:33

start to drill down into subgraphs.

41:35

It's like maybe a five -step process.

41:37

Yeah, even with Loving

41:40

community calculation or any

41:42

centrality calculation, the graph

41:44

has to be simple.

41:46

Because very often, I

41:49

think the graph we

41:51

talk about is I

41:53

call it the multi

41:55

-domain graph. It has

41:57

different type of information

42:00

in one graph. So

42:02

computing a centrality

42:05

in that kind of a

42:07

hypergraph as a hypergraph

42:09

is very challenging or what

42:11

does it mean as

42:14

a result if you mix

42:16

human and the emails

42:18

and it's difficult. So that

42:20

process itself to me

42:22

is we already need to

42:24

prepare our transform our

42:26

graph data into a form

42:28

that is suitable for

42:31

that centrality computation. Very often

42:33

like you have to

42:35

already project into a specific

42:37

domain for that computation

42:39

to happen. Very

42:41

good. That's what

42:43

I was thinking is like the

42:46

data that you have only becomes

42:48

relevant once you've narrowed it down

42:50

in a certain way and you're

42:52

looking at a certain plane of

42:54

that domain and you say, okay,

42:56

now we're going to be focusing

42:59

in on this plane. That's

43:01

when certain nodes

43:04

and certain data and certain

43:06

connections become relevant because

43:08

you're looking at that layer

43:10

almost in my head

43:12

if I visualize it. And

43:14

we're talking about that

43:16

Google Maps example again, you're

43:18

diving deeper and deeper

43:20

and you see different structures

43:23

depending on the layer

43:25

that you're looking at. And

43:29

and this fits very well with

43:31

like did a mesh kinds of concepts,

43:33

you know Jean McDogone Talking about

43:36

how different domains share you have to

43:38

abstract you have to come up

43:40

with the relations I think chat also

43:42

has the idea of like contracts,

43:44

you know where you have relations across

43:46

domains So you share some definitions

43:48

you have to you have to condense

43:50

down to that level before you

43:53

can go across domain so Yeah,

43:55

if we use the domains in

43:57

an organization to kind of guide when

43:59

and where and how do we

44:01

condense down, then we can

44:03

really take advantage of this

44:05

kind of abstraction. But it's

44:07

almost like I realized after

44:09

I said it, there's

44:11

two vectors or there's two

44:14

dimensions that you are

44:16

looking at when you are

44:18

zooming in or zooming

44:20

out because you're playing on

44:22

the field of

44:24

granularity, but you're also playing

44:26

on the field of the domain

44:28

and what is relevant in

44:30

that domain. So if we have

44:33

that X and Y axis,

44:35

you can get more granular inside

44:37

of the domain, but then

44:39

you can also just go on

44:41

the X axis and change

44:43

domains. And so that, like a

44:45

kaleidoscope, when you turn it,

44:47

you see a whole different set

44:50

of relations. Yeah,

44:54

and I mean in an enterprise context

44:56

this gets really bizarre because you know

44:58

you The people in the domains that

45:00

you depend on may not even know

45:02

that you're out there You know, you

45:04

may be consuming from some log files

45:06

from another application that are like totally

45:08

driving your product So like can we

45:10

have some sort of contract so that

45:12

we know about each other? But

45:15

yeah scooting across the domains.

45:17

That's the that's the key

45:19

challenge to like leveraging these

45:21

kinds of technologies because usually

45:24

You are in a particular domain when

45:26

you're making those decisions, but for most

45:28

applications you have to combine a couple

45:30

domains, right? So it's usually

45:33

like there's something interesting going

45:35

on between like sales and

45:37

procurement or or sales and

45:39

marketing or or you know

45:41

some other business unit So

45:43

usually oftentimes you will have

45:45

to combine and do you

45:47

then try and create Two

45:51

different graphs that are connected to each

45:53

other, or is it one larger graph?

45:55

How do you look at it in

45:57

that regard? Federation

45:59

sounds good. I think trying to

46:01

have one ginormous graph is usually...

46:03

weird. And those projects

46:06

usually don't ever end. But

46:08

federating and being able to go across

46:10

domains and say, okay, over there, let me,

46:12

let me send you something. I'd

46:15

like to know what you can,

46:17

what results can you bring back? So

46:19

are you making a prompt in

46:21

Graphrag across a different domain? Are you

46:23

making a query running some algorithm,

46:25

whatever? There's some kind of information transfer,

46:27

but federation. Yeah,

46:30

I can talk

46:32

about a couple my

46:34

personal experience. First,

46:38

bring information to graph is

46:40

a step forward, a

46:42

step up. Because

46:44

information as a tabular format,

46:46

it needs to be confined

46:48

to a very specific definitions

46:50

as pretty narrow domain. Graph

46:53

is, there's one example, I

46:55

look at the US flight

46:57

record. You can download it

46:59

from Department of Transportation. They

47:01

release every two weeks after. The

47:04

damn thing has 140

47:06

columns, I think. Really,

47:09

really wide. And

47:11

the reason is because the

47:13

flight may get diverted. Whenever

47:16

the flight gets diverted, you

47:18

add about 10, 15 columns

47:20

of information. So then

47:22

you need to capture that the flight

47:25

may be diverted more than once. twice

47:28

is that enough? No, three

47:30

times. Three is not no,

47:32

some is four times. So they

47:34

actually have five diversions. But

47:36

if you have six times too

47:39

bad, it cannot exist. So

47:41

that's the limits of

47:43

tabular format in the

47:45

information capture. With

47:47

the graph, it relaxes a lot. Naturally,

47:50

you can have a thousand diversions.

47:52

I don't care. You

47:54

can just like the graph

47:56

can keep a mind into

47:58

it. So that is really,

48:00

really a big improvement with

48:02

the graph to allow you

48:04

to have a lot more

48:06

flexibility in capturing the information. And

48:09

the other thing is like

48:11

very often in the tablet

48:13

format, it's very difficult to

48:16

check the mismatch. We

48:18

have example of bringing

48:20

two dataset manager from two

48:22

or three different departments

48:24

in the same organizations. Everybody

48:27

know other person's data

48:29

has a problem, but

48:31

you can't force other people to

48:33

fix it. But with the

48:35

graph, when you bring things together,

48:37

you immediately see the mismatches. And

48:39

that, so we have one example

48:41

of a company spend a couple

48:43

years, they could not reconcile the

48:45

data. But once they bring the

48:47

data into the graph, they start

48:49

to see the mismatch in one

48:51

month, they fix the data problem.

48:54

But they start to see the

48:57

mismatch because of the dependencies? Because

49:00

now, let's see,

49:02

you know the records are

49:04

unique, right? But

49:06

then when you link the

49:08

other record together, you need to

49:10

see, oh, this record is actually

49:12

duplicating other systems that they recorded

49:15

differently. Somebody made a mistake there.

49:17

Yeah. We see that a lot for

49:19

entity resolution work. You think like

49:21

a social security membership unique. But

49:24

then you're bringing in data from some other

49:26

sources. And there was an

49:28

application where maybe early on the product manager

49:30

said, yeah, we need to collect this

49:32

whole security number. And then later on they

49:34

said, oh no, we can't do that.

49:36

Just put it in a dummy number. And

49:39

so now you've got like this data

49:41

set that has, you know, 5 ,000 instances

49:43

of the same social security number. So once

49:45

you start to put in a graph,

49:47

you're like, wait, isn't that supposed to be

49:49

unique? How come there's like this enormous

49:51

node with like all these things connected to

49:53

it? Something's wrong. So

49:56

it's really also a great way

49:59

to figure out data quality issues.

50:01

Yeah. Although there's security.

50:04

I mean, going back to what we were

50:06

talking about before. if you are looking

50:08

in financial investigations, if you're looking in sort

50:10

of criminal investigation, okay, maybe

50:12

you've got some open data, like

50:14

here's, you know, sanctioned shell companies

50:16

or whatever. And then

50:19

maybe you've got some private information

50:21

like customers, but maybe you've also got

50:23

some feeds of like, oh yeah,

50:25

here's an active investigation. We're looking at

50:27

these people. But then

50:29

these particular people, they

50:31

have, you know, immunity

50:34

because they're diplomats. So

50:36

like there's all these different levels of

50:38

security and you start to pull it all

50:40

together in a graph. You get a

50:42

very comprehensive view. Maybe not everybody

50:45

can even see that. Like you don't,

50:47

you know, you don't want the police

50:49

officers who are doing parking tickets to

50:51

know that, you know, XYZ diplomat might

50:53

be investigated for a crime. Like that

50:55

information should not go out. So

50:59

where do you draw the line? Because

51:01

the graph really brings it all together.

51:03

But then how do you handle security

51:05

issues? The

51:07

access control with the

51:10

graph is automatically harder than

51:12

the tabular, the traditional

51:14

database. Well, it feels

51:16

like one of these, what

51:18

you were talking about, with

51:21

the ways that you visualize

51:23

it, you can

51:25

almost create different

51:27

access controls on

51:29

the visualizations. So

51:31

I don't know if you've thought through that.

51:33

in a way, but is that kind of

51:35

how you go about it? So

51:37

fundamentally, access control needs to be

51:39

in the data management layer. Like

51:43

if the database can

51:45

support access control, you're

51:47

great. We

51:49

actually, however, run into a situation

51:51

that database do not have the

51:53

sufficient access

51:55

control that supports business needs. So

51:57

in that situation, we actually

51:59

have to implement a future layer

52:01

in the data access. When

52:04

we put the data from the

52:06

database, depends on

52:08

the roles and teams,

52:10

and we actually

52:12

prohibit certain information from

52:14

being accessed. But

52:16

that's not a fundamental solution.

52:18

Fundamental solution has to be

52:21

in the data management layer.

52:23

It's a hard problem. In

52:25

previous work, which is

52:27

more like knowledge graphs

52:30

being used for large

52:32

-scale manufacturing, one

52:34

of the things we ran into

52:36

is security access because you take

52:38

procurement data, plus some operations data,

52:40

plus some sales data, put it

52:42

all into a graph. Suddenly,

52:44

you have a picture of how

52:46

the company works. But it's like a

52:48

really confidential picture. It's like maybe

52:50

the board could see this, but nobody

52:52

else in the company should see

52:54

it. So there's a real power

52:56

there, but there's always a risk. And

52:59

how do you manage that is

53:01

a mind -bogglingly difficult problem. I

53:05

read a book

53:07

talk about... the

53:09

certain like intelligentsia communities

53:11

when they go to another

53:13

countries. In the past,

53:16

you use like a falsified

53:18

identities, but today is

53:20

not good idea anymore because

53:22

all the open source

53:25

intelligence out there, even you

53:27

want to with help

53:29

some information, but people can

53:31

stitch together picture because

53:33

of a related piece of

53:36

information, sit there, outside

53:38

and the social media like

53:40

maybe there's a picture of

53:42

you with somebody that you

53:44

did not take a picture

53:46

did not post it but

53:48

somebody posts on Instagram and

53:50

so all those information out

53:52

there can essentially is a

53:54

graph can link back to

53:57

you even though you turn

53:59

really hard to stay hidden

54:01

at that. That's the

54:03

fundamental problem in terms of

54:05

privacy security, or you want

54:07

to control the access information,

54:10

but because you have all

54:12

those connections in the

54:14

graph, that make it really

54:16

hard. And a corollary

54:18

with that, when I talk

54:20

with people in enterprise who are

54:22

doing large -scale knowledge graph practices, the

54:25

one thing that I keep

54:27

hearing over and over again is

54:29

companies using graphs for market

54:31

intelligence, or maybe sometimes you would

54:33

say competitive intelligence. But

54:36

a lot of this might be

54:38

for sales win -back strategies, trying

54:40

to understand who's the competitor that got our bid

54:42

away from us. How can we go back

54:44

and try to... give

54:46

them a better quote. Oh,

54:48

wow. And so I've heard this

54:50

over and over again. We're like, that's

54:53

one of the first graphs that

54:55

starts making a lot of money is

54:57

like literally doing intelligence inside the

54:59

enterprise. Yeah,

55:01

I was going to go down

55:03

that route of like, let's talk

55:05

about a few other cool use

55:08

cases that you have seen, whether

55:10

it's just graphs, or it

55:12

is graph rag, which is

55:14

a hot term these days, you

55:16

know? I

55:19

mean, you know, it's

55:21

interesting. There's a lot

55:23

of graph database vendors, and they really kind

55:25

of lean heavy on the graph query

55:27

side of how to run this. And that's

55:30

something that's very familiar with people in

55:32

data engineering, data science, you

55:34

know, using a query. But I

55:36

think in the graph space, there

55:38

are other areas that aren't query

55:40

first, like using graph algorithms or

55:42

using There's a whole

55:44

other area of what should be

55:46

called statistical relational learning, but you know,

55:48

you've probably heard of like Bayesian

55:50

nets or causality or different areas over

55:53

there of using graphs. But

55:55

then there's also graph neural networks,

55:57

like how can we train deep learning

55:59

models to like understand patterns and

56:01

try to suggest, hey, I'm

56:03

looking at like all the contracts you

56:06

have with your vendors. And

56:08

I noticed that these three here are missing some

56:10

terms. Do you, you know, is that a

56:12

mistake? So I

56:14

think that, you know, there's,

56:16

there's the queries, there's the algorithms,

56:18

there's the causality kind of,

56:20

you know, that

56:25

area of, there's

56:27

also the graph neural networks. There's

56:29

a few other areas too, but these

56:31

are These are all like different camps

56:34

inside of the graph space. They don't

56:36

always necessarily talk with each other, but

56:38

I think it's really fascinating now that

56:40

we're starting to see more and more

56:42

hybrid integrations of them. Yeah.

56:46

I like to point out

56:48

the fundamentally graph and table

56:50

are two sides of the

56:52

same coin. As

56:54

a physicist, we

56:56

look at the sound, music.

56:58

both from frequency domain like

57:00

is a c d e

57:02

f what's the frequency distribution

57:05

and also look at what

57:07

waveforms like time time domain

57:09

like some some situation you

57:11

want to filter or you

57:13

want to access more on

57:15

the frequency domain some sometime

57:17

makes more sense on the

57:19

waveform domain the the same

57:21

data like like graph essentially

57:23

is a giant If

57:26

you think about the

57:28

large language model neural network,

57:31

it's a graph, but

57:33

it's a gigantic,

57:35

extremely sparse matrix, which

57:38

is table, right? And

57:40

the fact back because

57:42

it's such a giant sparse

57:44

matrix causing today, NVIDIA

57:46

is really hard because NVIDIA

57:48

has these GPUs that

57:50

can process those matrix. But

57:52

guess what? My

57:55

brain consumes about 19

57:57

watt energy. The

57:59

GPU running large -language

58:01

model consumes tens of

58:03

thousands of watt of

58:05

energy to get similar

58:08

computation needs. And

58:10

that's extremely inefficient. Even

58:12

though the computation unit is

58:14

much smaller than my neuron, you

58:17

think it should suppose to

58:19

compute a higher efficiency. That's

58:21

precisely because they're dealing with

58:23

extremely sparse matrix. They're not dealing

58:25

the neural network as a

58:27

graph. They're dealing neural network as

58:29

a matrix. And that's fundamentally

58:31

the problem for the power efficiency.

58:34

So there are certain models

58:36

that come up that really

58:38

deal with AI as a

58:40

graph that several automattitude save

58:43

in energy consumption. So

58:45

in the real world application, The

58:48

one reason why Graf hasn't been taken

58:50

off as we all think for the

58:52

past 20 years like oh Graf gonna

58:54

take off, Graf gonna take off, but

58:56

no it did not. The

58:58

fundamental problem is because

59:01

we are so familiar with

59:03

all the tools and

59:05

methodologies like workflows is well

59:07

established in the tabular

59:09

based way of thinking. It's

59:11

like the dependent transportation

59:14

do not release the flight

59:17

data as a graph they

59:19

released as a table

59:21

is easy to access we

59:23

have all the toolings

59:25

that mature to change that

59:27

is extremely difficult so

59:29

in the way I would

59:32

argue that AI is

59:34

always almost made for graph

59:36

because AI suddenly allow

59:38

you to process unstructured information

59:40

like emails reports this

59:42

like a podcast transcriptions like

59:44

videos into a

59:46

structural form that computer can access.

59:49

But guess what? It

59:51

is a graph that AI

59:53

will convert those data into. So

59:56

now you suddenly have this, some

59:58

people argue, I think it's like

1:00:00

80 % of the information existing

1:00:03

on structural form. Some people argue

1:00:05

that even the percentage even larger. So

1:00:08

the AI suddenly make

1:00:10

this like the majority

1:00:12

of the information available

1:00:14

for analytic workflow at

1:00:17

assessment. And

1:00:19

the funny thing is, you need

1:00:21

graph to do that. So

1:00:23

in the way that my

1:00:25

assessment is, because of AI,

1:00:27

because of AI, we're

1:00:30

actually entering the boom,

1:00:32

like exponential growth error

1:00:34

of a graph, because

1:00:36

the availability in the data.

1:00:39

It's like the internet of

1:00:41

things. We've been waiting

1:00:43

for it to happen since

1:00:45

2010 or 2005 whenever

1:00:47

and it's always just around

1:00:49

the corner. But now

1:00:52

it does make sense that if you

1:00:54

have all of this unstructured data and

1:00:56

you have these relations, then that sounds

1:00:58

like a graph to me. Yeah.

1:01:01

And going back to

1:01:04

like 1980s era, hard

1:01:06

AI, you know, whether we're

1:01:08

talking about like A star B star

1:01:10

kind of algorithms or talking about planning systems,

1:01:12

all of these were expressed as graphs. And

1:01:15

like, you know, some of the early

1:01:17

thinking that that was like pre -Google

1:01:19

that led to Google, they were talking

1:01:21

about graphs. Some of that

1:01:23

work actually came out of like groupware,

1:01:25

but based on graphs. So it's there. Funny

1:01:28

you say that because we

1:01:30

had one of the talks at

1:01:32

the AI Quality Conference back

1:01:34

last year. was from the

1:01:36

guy who created Docker, Solomon. And

1:01:39

his whole talk was really like,

1:01:41

everything's a graph. If we

1:01:43

really break it down, it's just, it's

1:01:45

all graphs and how one thing relates

1:01:48

to another thing. I'll throw, I'll

1:01:50

throw something else in to kind of

1:01:52

go back to our early part. We

1:01:54

were talking about East meets West. There's

1:01:56

a book that A really

1:01:59

favorite book, though, from early days.

1:02:02

This is like going back to the early

1:02:04

90s, but early days of neural networks. About

1:02:07

this idea of like, yeah, there's

1:02:09

some conventions in the West, maybe we

1:02:11

can back off. It's by

1:02:13

USC professor called Bart Kosko. It's

1:02:15

called Fuzzy Thinking. And

1:02:17

sort of his critique of

1:02:19

science, but more from a lens

1:02:22

of more Eastern perspectives. I

1:02:25

know that this book is like more than

1:02:27

30 years old, but I think that there's

1:02:29

some really great perspectives there that weigh in

1:02:31

a lot, especially what Wei was saying about

1:02:33

like, where are we now with LLMs and

1:02:35

how we're leveraging this in the context of

1:02:37

graphs. So

1:02:40

I think the other thing,

1:02:42

was there anything else that you

1:02:44

guys wanted to talk about

1:02:46

before we jump? I know

1:02:48

there's a lot of cool data

1:02:50

visualization stuff. that you're doing way.

1:02:52

Yeah, I just want to add

1:02:54

one thing. I

1:02:57

just want to say

1:02:59

the visualization is not the

1:03:01

end. The

1:03:03

goal is to support analytics.

1:03:07

So I know everybody when it

1:03:09

comes to the graph, talk

1:03:11

about graph visualizations. But

1:03:13

in my mind, what's

1:03:15

really what we need is visual

1:03:17

analytics. How can

1:03:19

we visually transform the information?

1:03:21

How can we visually go

1:03:23

from like information that was

1:03:26

suited for data management, for

1:03:28

data capture, that was so

1:03:30

you can access, work them

1:03:32

step by step towards information

1:03:34

that's suitable for presentation for

1:03:36

answering the specific questions in

1:03:38

that particular domain. So

1:03:41

that steps requires a transformation

1:03:43

of data is not just

1:03:45

like a filter. but

1:03:47

also fundamentally in the

1:03:49

graph schema mutation. The

1:03:52

schema you have for the

1:03:54

data capturing is not a schema

1:03:56

suitable for presentation. There are

1:03:58

two different things. If

1:04:01

you think about in the big data era, the

1:04:04

development of the map

1:04:06

reduce allow you to

1:04:08

have this step -by -step

1:04:10

flow of information from

1:04:12

the original captured. tabular

1:04:14

format into a very

1:04:17

different table that you

1:04:19

can present. In

1:04:21

graph, it's the same thing

1:04:23

that the graph anything needs is

1:04:25

a step -by -step, like we

1:04:27

call it calculus or operators, to

1:04:29

transform your data from the

1:04:32

form that's been captured to the

1:04:34

form that you want to

1:04:36

present to answer the question. Now

1:04:39

that calculus It's

1:04:42

based on, I think it

1:04:44

needs to be in two

1:04:46

forms. It needs to be

1:04:48

in the form that you

1:04:50

can process data in large

1:04:52

quantity, like a large graph

1:04:54

step by step mutates. But

1:04:57

also needs to be visually. You

1:04:59

need a same set of, a

1:05:01

parallel set of operator

1:05:03

that a data analyst,

1:05:06

but ideally a domain

1:05:08

expert, not a data. not

1:05:11

somebody who can write

1:05:13

Python or Cypher queries or

1:05:15

GQL. But somebody

1:05:18

with the domain knowledge, look at it,

1:05:20

because graph is so visual. You're

1:05:22

like, hey, I want to

1:05:24

simplify this. Oh, I know

1:05:26

Paco and Wei has so

1:05:28

many meeting points. Let's abstract

1:05:30

that out. Let's just create a

1:05:33

single reading stream that Wei inference,

1:05:35

like Wei and Paco, that they

1:05:37

know each other. and get

1:05:39

rid of all the other information. So

1:05:42

this all maybe say, hey, Parkour

1:05:44

knows a million people. Maybe I

1:05:46

underestimate a little bit of Parkour.

1:05:48

So sorry about that. No

1:05:50

kidding. You probably know more than that. But

1:05:52

let's from the graph, we can quickly compute

1:05:54

this number and put it in the Parkour,

1:05:56

make Parkour very, very big because Parkour knows

1:05:58

a million people. So

1:06:01

that kind of operation is

1:06:03

highly intuitive. So I

1:06:05

want to stress this. The

1:06:07

visualization for graph is not

1:06:09

end. The visualization for graph

1:06:11

is tool you use to transform

1:06:13

the graph to get you the answer.

1:06:15

That's a way point. Very

1:06:17

good. Yeah, that

1:06:20

is very in line with

1:06:22

what you were saying earlier

1:06:24

on how when you don't

1:06:26

know the question, that's sometimes

1:06:28

the hardest part. And so

1:06:30

being able to wrestle with

1:06:33

the data in different forms,

1:06:35

one being the visualizing it

1:06:37

in different ways. That's one

1:06:39

tool to hopefully help you

1:06:41

get to the answer or

1:06:43

first step, the question, which

1:06:45

can then lead to the

1:06:47

answer you're looking for. Yeah.

1:06:50

And to mutate the graph

1:06:52

visually. So you

1:06:54

can start poking it.

1:06:57

Yeah. Yeah, exactly.

1:06:59

It does feel

1:07:01

like the ability to

1:07:04

just mutate

1:07:07

the graph is such a

1:07:09

strong tool. Because of

1:07:11

all these different reasons that

1:07:13

we had mentioned when it comes

1:07:15

to the depth and the

1:07:18

way that you're able to look

1:07:20

at the domains or you're

1:07:22

able to just find anomalies or

1:07:24

find different data quality issues,

1:07:26

whatever it may be, whatever your

1:07:28

use case is, it's very

1:07:30

cool. It does sound though instinctively

1:07:33

a bit manual though, right? So

1:07:37

far I think way has

1:07:39

brilliant examples what they're doing like

1:07:41

with site XR of leveraging

1:07:43

3d visualizations zoom in zoom out

1:07:45

in conjunction with algorithmic ways

1:07:47

Using graph algorithms to sort of

1:07:50

focus the lens focus the

1:07:52

search light I think that more

1:07:54

can be automated over time

1:07:56

and maybe this is where agents

1:07:58

come in is actually helping

1:08:00

determine How to how to be

1:08:03

the cinematographer there on the

1:08:05

graph? Yeah So there's definitely a

1:08:07

way of helping you to

1:08:09

look at perspectives. And

1:08:11

very often we deal with the data

1:08:13

that's both graph -connected nature, but it's

1:08:15

also dimensional. Each

1:08:18

node has so many properties.

1:08:20

Each property is a dimension. So

1:08:22

it's high -dimensional information. So

1:08:25

which dimension set do you want

1:08:27

to take in combination with the

1:08:29

network? information to help you to

1:08:31

see, be able to have a

1:08:34

versatile way, flexible way of choosing

1:08:36

the dimension set, or it's very

1:08:38

often like when you shift from

1:08:40

one dimension to the other dimension,

1:08:42

you reveal some floccings of things

1:08:44

going together, some clustering that are

1:08:47

happening, it really says, hey, those

1:08:49

things always move in the same

1:08:51

direction. So those signals

1:08:53

help you to formulate a

1:08:55

lot of ideas, instincts from

1:08:57

the data. And then when

1:08:59

you see that information, the next thing you want

1:09:01

to know, hey, I want to capture that as

1:09:03

a feature. Now,

1:09:05

can you represent that as

1:09:07

a feature to that

1:09:10

become what you see become

1:09:12

a thing that become

1:09:14

an entity in your visualization

1:09:16

that you can put

1:09:18

back in there. That

1:09:20

is the visual

1:09:22

analytics. Whoa.

1:09:26

So capturing it as a feature

1:09:28

and then you can feed it

1:09:30

into the tabular data in a

1:09:32

way. Yes, exactly. Guys,

1:09:35

this is awesome. Is there

1:09:37

anything else that you want to hit on before

1:09:39

we stop? I feel like I've learned a

1:09:41

ton just from talking to you. I knew it

1:09:43

was going to be great conversation. I was

1:09:45

hanging on to my seat this whole time. It's

1:09:47

like, oh my God, I'm much. I learned

1:09:49

a lot too. Yeah. In

1:09:51

terms of cross -domain, I

1:09:54

want to show one funny

1:09:56

example, like how difficult cross -domain

1:09:58

is. So in this

1:10:00

example, it's an extreme

1:10:02

cross -domain. So I organize

1:10:05

a kind of tech arts,

1:10:07

dance and science, like

1:10:10

nonprofit. So

1:10:13

one thing we do every

1:10:15

week, every Wednesday, we bring people

1:10:17

in the engineers, science domain

1:10:19

and people in the dance, art,

1:10:21

music, domain together, we

1:10:23

explore something together and have

1:10:25

a conversation. The very first

1:10:27

meeting, when we bring

1:10:29

people together, that happened about

1:10:31

like 11 years ago. We

1:10:34

had about 20 people sitting

1:10:36

the room, everybody like a very

1:10:38

vibrant conversation. And then

1:10:40

that's the sudden realized something

1:10:42

that it is true that

1:10:44

everybody speak English, but nobody

1:10:47

can understand each other. Because

1:10:51

they're using simple

1:10:54

cavities. But because of

1:10:56

domain, just like Paco

1:10:58

talked about earlier in the enterprise

1:11:00

setting, because of the domain

1:11:02

difference, they mean

1:11:04

totally different things. A

1:11:06

physicist talk about energy, we have

1:11:09

very concrete things that we call

1:11:11

energy. A

1:11:13

dancer call energy is

1:11:15

a very different way

1:11:17

of energy. When

1:11:19

the computer people talk

1:11:22

about Python, we're not

1:11:24

talking about a snake.

1:11:27

But the dancer, when they hear Python, they're like,

1:11:29

why are you bringing a snake to the

1:11:31

conversation? So

1:11:35

I think just

1:11:37

accurate what Parker said

1:11:39

earlier in the enterprise data

1:11:41

context, that

1:11:43

domain. is very,

1:11:45

very important to be aware of

1:11:48

the domain, knowing the limit

1:11:50

of the domain and how to

1:11:52

find a way to cross -domain. For

1:11:54

us, it's generally a lot of compensation.

1:11:56

I think it's a human problem.

1:11:58

It's not a technical problem. Techno

1:12:01

can help, but only do

1:12:03

that much. We

1:12:06

had a conversation on

1:12:08

here a few months ago

1:12:10

with folks who had

1:12:12

created a data analyst agent.

1:12:14

and they said one

1:12:16

of the hardest parts for

1:12:18

the success of this

1:12:21

agent was to first create

1:12:23

a glossary of business

1:12:25

terms so that the agent

1:12:27

and really trying to

1:12:29

nail down these fuzzy words

1:12:31

and these words that

1:12:33

maybe for one person they

1:12:36

mean one thing and

1:12:38

another person they mean another

1:12:40

thing and the quintessential

1:12:42

example of this is an

1:12:44

MQL When you're at one

1:12:46

company an MQL or when

1:12:48

you're on one team an

1:12:50

MQL is one thing and

1:12:52

when you go to another

1:12:54

team an MQL is another

1:12:57

thing they all mean marketing

1:12:59

qualified lead but When does

1:13:01

that person become a marketing

1:13:03

qualified lead? What do they

1:13:05

have to have done or

1:13:07

what stage are they in

1:13:09

and so the agents may

1:13:11

understand and the LLMs understand

1:13:13

what an MQL is kind

1:13:15

of, but you really have

1:13:17

to flesh out this glossary

1:13:19

to let them know all

1:13:22

of these different terms that

1:13:24

you use and that are

1:13:26

in your database. So

1:13:28

when the agent needs to go

1:13:30

and pull, how many MQLs did

1:13:32

we have last week? It

1:13:34

understands what that means. Yeah,

1:13:37

that's your semantic layer right

1:13:39

there. That's your that's a controlled

1:13:41

vocabulary that you put enough

1:13:43

these together you get your ontology

1:13:46

Yeah, yeah, yeah exactly

Rate

Get this podcast via API

From The Podcast

MLOps.community

Relaxed Conversations around getting AI into production, whatever shape that may come in (agentic, traditional ML, LLMs, Vibes, etc)

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More