Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
So I'm Wei. My full name
0:02
is actually Wei Dong Yang, but Wei
0:04
is easier to pronounce. And
0:06
I'm CEO for cannabis at
0:08
Data Analytics, a visual data
0:10
analytics company. And
0:12
I love coffee.
0:14
I think the civilization starts with
0:16
the invention of coffee. So I have to
0:19
drink a coffee. I do
0:21
add milk to coffee because the black
0:23
coffee is a little bit too strong
0:25
for me. Welcome back
0:27
to another in the lobs community
0:29
podcast today We are lucky enough
0:31
to have not one but two
0:33
graph experts who have been doing this
0:35
for a very long time. I
0:37
got schooled I felt like I
0:39
learned a ton about how to
0:41
use graphs as Tools and ways
0:43
that we can leverage them better. Let's
0:45
get into this conversation with Paco
0:47
and way as always I'm your
0:49
host Demetrios and You know what
0:52
is a huge help if you
0:54
can hit little review and
0:56
whatever you are listening to this on that
0:58
would mean the world to me boom
1:00
let's jump into it and oh
1:02
yeah if you are one
1:04
of those people that is listening
1:06
on a podcast player I
1:08
have got the recommendation for
1:11
you for our
1:13
music record this
1:15
is thanks to
1:17
one of the people
1:20
in the community, Lee Wells, who
1:22
just joined and now whenever someone
1:24
joins the community, I ask them
1:26
what their favorite music is. Today,
1:29
we're listening to We Are One by Maze.
2:30
We're talking about PII and
2:32
using different methods to anonymize
2:35
data, right? And Paco, you
2:37
had said something that I
2:39
didn't fully understand, and then
2:41
Wei, you said something else that I didn't fully
2:43
understand, so maybe we can rehash that and I
2:45
can understand it the second time. Awesome.
2:50
Well, I was going to ask if you
2:52
all ever came across, there's another podcast
2:54
that I followed called The Dark Money Files,
2:57
and it's people who There's
3:00
a couple of consultants who have worked in
3:02
banks and understand a lot of the
3:04
ins and outs of financial crimes
3:07
and investigations. And
3:10
so I was just gonna preface it because
3:12
they've had a great series recently. If
3:14
you've ever heard of this thing called the
3:16
SAR, it's a suspicious activity report. And
3:18
the laws are really weird depending on
3:21
what country the bank is in. But
3:23
basically this, if you're at a
3:25
bank and you see some suspicious activity, like
3:27
there's a money transfer, and the
3:29
counter party is like a known
3:31
terrorist group or something, you see
3:33
something weird going on. Okay, number
3:35
one, you have an obligation to
3:37
report a crime to a criminal
3:39
investigation unit. If
3:41
you see something suspicious and you don't report it,
3:43
that's a crime. If
3:45
you see something suspicious,
3:48
you have not an obligation,
3:50
but a responsibility to send it up
3:52
the chain so that other financial houses
3:54
might share. If
3:57
you send too much information, you might
3:59
get sued. And then
4:01
so there are these reports and it
4:03
usually costs on average about $50 ,000 to
4:05
process each report. So you don't want to
4:07
generate too many of them. And like
4:09
machine learning models could generate thousands per day,
4:12
which would be like, you know, tens
4:14
of millions of dollars of liability. So this
4:16
whole space of like, what do I
4:18
do? I'm getting, I'm getting attacked. And what
4:20
do I do? Because I mean, also
4:22
these people are taking money and you might
4:24
have to. under some
4:26
situations as a bank, you might
4:28
have to compensate if there
4:30
is some kind of scam. So
4:33
you could be losing money and
4:35
facing like legal threats from three sides.
4:38
And meanwhile, there's this thing called
4:40
a SAR. And like, I've actually been
4:42
yelled at for asking what I was supposed to
4:44
integrate with something. And I was like, can I see
4:46
what the scheme is like? No, you're not allowed
4:48
to, no, it's too confidential. So it's like, it's
4:51
just this whole tangle of worms about
4:53
how to What
4:56
do you actually do with once
4:58
you have evidence of financial
5:00
crime or even suspicion of it?
5:03
What next steps you take are really
5:05
tangled. And I think, Wei Dong,
5:07
you probably have a lot more experience
5:09
about this in certain theaters too. I
5:13
have some similar experiences where
5:15
even the schema is
5:17
not allowed to see because
5:19
the schema may actually
5:21
reveal some... secrets or certain
5:23
activities may become liable to
5:25
certain parties. So that
5:28
can be pretty tricky. And
5:30
so it basically gives
5:32
away information that if you
5:34
were looking at it, you now, because
5:37
you know the schema, you can
5:39
guess a few other. parts of
5:41
this puzzle and get information that
5:43
people don't want out there. The
5:45
banks are using a lot of
5:47
data that come from providers. There
5:50
may be other cases where
5:52
there's data that's coming from,
5:54
say, public sector agencies, crime
5:56
investigations. There may be
5:58
intelligence reports, and so there may be
6:00
parts of the schema that are highly sensitive
6:02
and only certain people are allowed to see.
6:05
But you were saying
6:07
that with graphs...
6:10
anonymizing that PII, you're
6:12
still able to gather insights,
6:14
right? Yeah, that was
6:16
cool. We were just in a talk
6:18
and Brad Corey from Nice Actimize
6:20
was showing where like they're preparing to
6:22
do RAG and they were using
6:24
I think Bedrock and they know that
6:26
they've got a hot potato. They
6:29
know they've got a lot of customer
6:31
PII that just can't go outside
6:33
the bank. So what they
6:35
were doing is substituting PII with
6:37
Unique identifiers, they generate tokens, they
6:39
generate on the fly, and then
6:41
they make the round trip after
6:43
they've run three LLMs and made
6:45
a summary, and they replace
6:47
the tokens with the highly
6:50
confidential material they just
6:52
have internally. And so this
6:54
is a way of being able to use
6:56
some sort of external AI resources, AI resources, but
6:59
still manage a lot
7:01
of data privacy. Yeah,
7:06
I've seen it with we had
7:08
these folks on here from tonic
7:10
AI and they were talking about
7:12
how they would use Basically the
7:14
same Information but swapping it
7:16
out. So if it is
7:18
someone's name, they just changed
7:20
the name So it went
7:22
from Paco to John and if
7:24
it is a social security number,
7:27
they would swap out the social
7:29
security number and totally randomize the
7:31
number But it still is
7:33
a social security number. So you,
7:35
at the end of the day, you
7:38
get almost like this double blind.
7:41
So even if you're a data scientist
7:43
looking at the information, you can understand
7:45
it. But you don't
7:47
know if it is the
7:49
true information that's going to reveal
7:51
that PII. Interesting.
7:55
Interesting, yeah. Although
7:58
I do see situations where Even
8:02
the structured the document itself
8:04
is gets revealed it revealed
8:06
Information that you do not
8:08
want people to know like
8:10
you reviewed it like in
8:13
the investigation space very often
8:15
you do not want people
8:17
being investigated know that being
8:19
investigated But certainly information even
8:21
the structure is you structure
8:23
the document being reviewed can
8:25
become a problem So so
8:28
part by some point I
8:30
felt like the
8:32
in -house on -prem
8:34
LLM might
8:36
be necessary, especially
8:39
just red news
8:41
that the M3
8:43
Ultra Studio with
8:45
the 500GB RAM,
8:47
Karan large -dunk
8:49
remodels at 20
8:51
tokens per second,
8:53
that could potentially
8:55
be an interesting
8:58
solution for that. Yeah,
9:00
I mean for for our end
9:03
use cases, you know like 60
9:05
% of those are air -gapped and
9:07
so Yeah, you know the largest
9:09
chunk of that they're they're gonna
9:11
be a lot of like public
9:14
public sector agencies running in skiffs
9:16
So they they can't do any
9:18
data out. Yeah And
9:21
there's good news for running
9:23
really interesting LLMs on local
9:25
hardware. There's a lot of
9:28
really good news. I will
9:30
shout out to my friends
9:32
over at Useful Sensors, Pete
9:35
Wharton and company, I'll put that
9:37
in the chat. You
9:39
can do a lot with local hardware.
9:42
What are they doing? Useful
9:45
sensors, so Pete
9:47
Wharton. and
9:50
Mentoreth Kudler, they were part
9:53
of the TensorFlow team at
9:55
Google. And for
9:57
I think like eight years, they
9:59
evangelized use of deep learning inside
10:01
of products at Google, like internally. And
10:04
then they left and the team, has
10:06
a startup in Mountain View now. And what
10:08
they're showing is, hey,
10:10
here's like $50 worth of hardware. Here's
10:13
an ARM chip with a neural network
10:16
accelerator on it. And we can run 3L
10:18
LMS on battery power. So
10:20
it's pretty cool because they
10:22
came out of like the tiny
10:24
ML, I don't know if you've ever seen the
10:26
conference. Oh, yeah. And
10:29
so. You know,
10:31
this is a lot of the
10:33
specialty that Pete has. And,
10:35
you know, he
10:37
was on the CUDA
10:39
team at NVIDIA before. So,
10:42
I mean, these folks really
10:45
know how to make AI infrastructure
10:47
run on hardware, and particularly
10:49
how to handle a lot of
10:51
low power and low latency
10:53
kinds of situations, and
10:55
where to punch through the
10:57
bottlenecks. You don't necessarily have
10:59
to have a ginormous GPU
11:01
cluster, although in some cases
11:03
it helps. But especially when you're
11:05
running inference, you can be running on much
11:08
lower power and doing really interesting things out
11:10
in the field. So wild
11:12
now, I know that
11:14
we had originally wanted
11:16
to chat a bit
11:18
about this idea that
11:20
I think way you
11:22
had proposed and it's a
11:24
little bit of a
11:26
a differentiation on Graphrag
11:28
and so maybe you
11:30
can set the scene for
11:33
us because yeah, I
11:35
want to go deeper
11:37
there Yeah I
11:39
run in danger
11:41
of pulling way
11:44
far. Fundamentally,
11:48
I think with LLM,
11:51
whole machine processing information has
11:53
changed. Before
11:55
LLM, everything is
11:57
exact, symbolic, like
12:00
matching all the APIs, all
12:02
the rigid data
12:04
structures. Just think about...
12:08
deep blue when beat the
12:10
chest, everything is rigid
12:12
knowledge as rules and
12:14
things. L .O .M.
12:17
changed everything because L .O .M. started
12:19
to understand things in the
12:21
contextual base, started to
12:23
understand fuzzy things. And
12:25
it suffers the same
12:27
weakness of a human
12:29
being, not exact, like
12:31
we glide over information,
12:33
we draw conclusions, we
12:35
make lips. make
12:37
jumps. But at
12:39
the same time, LLM's
12:42
ability to reason like
12:44
humans, that for me
12:46
is fundamentally changed how
12:48
we approach the computing. And
12:52
so in
12:54
applying LLM to
12:56
analyze documents, my
13:00
analysis is now we
13:02
can let LLM work more
13:04
like humans. rather than
13:06
like machine, we understand in
13:08
the past, that also
13:11
implies what the data structure
13:13
is preferred for LLM, which
13:15
I would argue that
13:18
a data structure, a data
13:20
management that preserves as
13:22
much contextual information as possible,
13:24
preserves as much nuance
13:26
as possible, that the
13:29
subtle nuances may come out
13:31
to be important. So
13:34
so I use the example
13:36
of my wife is Brazilian
13:38
the American tourist to Brazil
13:40
gets invited to a horse
13:42
party says the party start
13:44
at 6 p .m. So
13:46
so as a good American
13:49
guy show up promptly on
13:51
time at 6 p .m. And
13:53
the hostess comes out still
13:55
wrapped in the shower shower
13:57
tower and totally confused and
13:59
So right And the turnout
14:01
over there when this is
14:03
6pm is where the hostess
14:05
start thinking about the party,
14:07
start like going out shopping,
14:09
preparing food and getting ready.
14:12
And the people usually don't show up until like
14:14
two or three hours later. And
14:16
a bad culture difference. Yeah,
14:19
if we try to capture
14:21
that in a knowledge graph, what
14:24
kind of construct allows us
14:26
to capture those subtle cultural
14:28
nuances there? And that might
14:30
become important in understanding the
14:32
document later. So I think
14:34
that's the challenge. Yeah. Parkoo,
14:37
you want to add something there? Let's
14:39
hear what you think. Well,
14:42
from a perspective of natural
14:44
language, something that the models
14:46
bring in, but it's kind of a
14:48
nuance and I don't think it's talked
14:51
about a lot. There's a very recursive
14:53
nature to how we as people talk
14:55
with each other and tell stories and
14:57
share information. We do reference
14:59
it in the sense of like going down
15:01
the rabbit hole. Like if you follow
15:03
a thread too far, you're kind of going
15:05
down the rabbit hole. And there's this very recursive
15:07
nature of how we think and especially how
15:10
we express. It certainly comes across
15:12
in written language, although we tend
15:14
to think of written language as
15:16
something linear. There's paragraphs and sentences,
15:18
and it can all be diagrammed.
15:20
But when you look at the
15:22
actual references that are inside of
15:24
those sentences, they're making recursive calls
15:26
throughout a story, throughout
15:28
somebody's speech or throughout
15:30
a book. And
15:33
we can try to linearize that and
15:35
come up with an index or a bibliography,
15:37
but at the end the day, it's
15:39
a graph. And you get this
15:41
very self -referential thing in any text. And
15:43
this is something that the LLMs have
15:45
really, I think, pulled out. And
15:48
we were also just part
15:50
of the talk we were
15:52
in, also Tom Smoker from
15:54
Why How is Showing about
15:57
how they leverage ontology, they
15:59
leverage schema, and chase after
16:01
information recursively. So
16:04
that's just another kind of view
16:06
on this, but I, way
16:09
I love how you all
16:11
are approaching this. You have a
16:13
very powerful view of kind
16:15
of relaxing the constraints upfront, but
16:17
then having the context propagated
16:19
through. I realized there's an important
16:21
philosophical approach difference between East
16:23
and the West. And
16:26
the Eastern philosophy very much drive
16:28
towards the nature of things. And
16:30
it's important, which is that
16:32
that's very curiosity about nature of
16:34
things, that the desire to
16:36
have a definitive definition of nature
16:39
of something is led to
16:41
the great scientific discovery over the
16:43
past several hundred years. The
16:46
Eastern philosophy very much on
16:48
the outside is focused on the
16:50
contextual, focused on shifting, changing
16:52
nature of things. Like
16:54
the Chinese Bible, the Daoism, the
16:56
Bible, Dao De Jing, the
16:58
first verse it says Dao Ke
17:00
Dao Fei Chang Dao means
17:02
if you name something, you get
17:04
it wrong. Or it's
17:06
not permanent. It's
17:08
really focused on impermanence of things.
17:10
It focuses on everything changes
17:12
nature in context with other things.
17:15
So that is essentially a
17:17
graph. Now,
17:19
you're putting both things together.
17:22
So, okay, I have to
17:24
say that that attitude towards
17:26
like, oh, everything changes. Thus,
17:28
we cannot see anything. Thus,
17:30
everything is fuzzy, is very
17:32
much contribute to the Chinese
17:34
technology science developed very far
17:37
in about a thousand years
17:39
ago and stalled. And
17:41
a lot of its attribute to this
17:43
like philosophical like
17:45
things like reduce a lot
17:47
of curiosity and drive down deeper
17:49
into the nature of things.
17:52
However, in practical things, there's
17:54
some practical application of that
17:56
approach, which in today with
17:58
LAM and graph, we really
18:01
see that it's like a
18:03
great combination of you allow
18:05
certain things to be drilled
18:07
down to be very definitively
18:09
defined, to be clearly defined
18:11
within the context. But
18:13
a lot of information,
18:15
contextual information, stay fuzzy.
18:19
So in fact, I feel
18:21
like I'm really excited about
18:23
integrating sensing and our graphics
18:25
are a kind of a
18:27
solution together because the sensing
18:30
helps to drive this definitive
18:32
part. Once you have
18:34
the definitive part, drill it
18:36
down. named, defined, it really
18:38
speeds up to make a
18:40
lot of assessment fast, definitive,
18:43
and precise, which is crucially
18:45
important. But on the
18:47
other hand, you allow
18:49
this loose structure of information
18:51
decomposed as a graph
18:53
that you can easily retrieve.
18:56
and without losing the nuances, the
18:58
subtleties, like in the in the
19:00
in the cultural differences, things like
19:03
you still preserve that. So don't
19:05
think come together. I feeling is
19:07
the one how you how you
19:09
want to grant LLM to protocol
19:11
to create a precise accurate and
19:13
no the limit, no when it
19:16
does not know not to make
19:18
a judgment. I think that's also
19:20
very very important. So in my
19:22
mind is the graph and AI
19:24
right now is present opportunity to
19:26
allow this Western way of Drive
19:28
to the nature of things and
19:31
Eastern way of focus on the
19:33
contextual information Come together to work
19:35
together to solve practical problems So
19:37
so very well said and you
19:39
know the the challenge we face
19:41
is we don't really know what
19:43
the downstream application will be Like
19:46
we're doing investigation. We're doing some
19:48
kind of discovery whether you're trying
19:50
to find you know Money launderers
19:52
or whether you're trying to find
19:54
you know, who's my best customer
19:56
for this hotel? It's a discovery
19:59
process and by nature of discovery
20:01
You don't know what the answers
20:03
are in fact in a complex
20:05
system. You don't even know where
20:07
or how just you know, it's
20:09
unknown unknowns, right? so by
20:11
preserving that context then you are
20:14
sort of fortifying yourself so
20:16
that When the time presents, you'll
20:18
be able to make the
20:20
right discoveries. You won't
20:22
have cut them off in advance. I
20:24
think if you go back to
20:26
before relational databases came out, you
20:28
go back to some of the
20:30
earlier writings from Ted Codd,
20:32
and one of his colleagues was
20:35
William Kent, who did... a book
20:37
called Data and Reality. If
20:39
you go back to some of
20:41
the early like 1970s thinking about data
20:43
management, it's really interesting to see
20:45
where the lines are drawn because in
20:47
this Western view, so much
20:49
of data management was about, let's
20:52
have a data warehouse, let's
20:54
pretty much throw away the relationships, let's
20:56
focus on the facts. We
20:58
have a lot of, as we were saying,
21:00
a very Western view of like, I
21:02
just want to know like millions of facts
21:04
and I will piece them together with
21:06
a query. I'm not, yeah, I'm not really
21:08
interested in preserving the context. So, I
21:10
mean, I think we have a long history
21:12
from like data warehousing of going too
21:14
far on the Western side. Well,
21:18
what is interesting to me
21:20
is the conversation that we
21:22
had with Robert Caulk on
21:24
here probably three months ago,
21:26
and how he said, we've
21:29
completely thrown out ontologies. And
21:31
for his specific use case, That
21:33
isn't the way that they wanted
21:35
to go. And I
21:37
wonder if you guys have thought
21:40
through that and what that looks
21:42
like, what the benefits are, and
21:45
is it one of these
21:47
things where you potentially are experimenting
21:49
on those levels too? In
21:51
my perspective, ontology is important, but
21:54
you have to know the boundaries. I
21:57
give a parallel into all the
21:59
theory in the physics theory, like Newton's
22:01
law. Newton's law is
22:03
important. It captures important truth
22:05
in the nature. However,
22:09
just like any physicist,
22:11
any physicist's theories, the
22:13
moment when the theory is as proposed, it's
22:16
a very important fact. Important concept
22:18
is you're waiting to be disapproved.
22:21
So you never accept as
22:23
the truth of everything. You
22:25
have a theory. Park
22:27
was a math scientist, so I think
22:29
he's also very familiar with the
22:31
concept. When you propose a theory, be
22:33
test true, but you're always
22:35
looking for situations, looking for the
22:37
boundaries where the theory will stop
22:39
to be true. So I
22:41
don't think ontology is
22:44
anything different. It's just like
22:46
ontology needs to be
22:48
very well -grounded. The contextual
22:50
context needs to be defined.
22:52
And within this context,
22:54
this ontology knowledge
22:57
is real. It's truth.
23:00
The problem I see as a
23:02
lot of traditional knowledge graph
23:04
approach is people ignore the fact
23:06
that ontology has to be
23:08
confined within a specific domain. The
23:11
moment you step out of the domain,
23:13
you have problem. But
23:15
the other thing is, we think
23:17
this domain ontology is fantastic. It
23:20
helps you to solve problems
23:22
so much faster, so much precise.
23:25
But again, as long as you
23:28
can define the boundaries, define the
23:30
domains, it's great. What
23:35
Rob Kolk and Ellen Tornquist
23:37
and others at Ask News,
23:39
what they're doing is they're
23:41
looking at news sources, especially
23:43
regional news sources across the
23:46
world, and they
23:48
really are finding hard
23:50
evidence, groundbreaking
23:52
evidence on the ground, literally,
23:54
if you're doing ESG. uh work and
23:57
you're trying to do diligence on a
23:59
company or a set of suppliers and
24:01
you want to find out like what
24:03
are their operations really like over in
24:05
that other country where they're based and
24:07
then you find out they're engaged in
24:09
like I don't know child labor or
24:11
something and you know you you want
24:13
to make other arrangements before your shareholders
24:15
find out um so I think with
24:17
Ask News you know they're out and
24:19
they're looking they're working with those publishers
24:22
and they're they're collecting that news and
24:24
representing it in a graph And
24:27
yeah, as
24:29
you were saying, I mean, an
24:31
ontology, ontologies really don't work across
24:33
domains. You really want to focus
24:36
more on like closed world within
24:38
a domain, having a
24:40
full enterprise wide ontology, nice
24:42
idea, but I rarely see it work. And
24:45
I think that in the case
24:47
of like understanding news reports in
24:49
the world, you don't know what
24:51
the domain is in advance. You
24:53
only know this is what is being
24:55
published. And so I think
24:57
by relaxing that constraint at Ask News,
25:00
they're able to come up with a
25:02
graph of like, here are things that
25:04
are related. You can follow this evidence
25:06
and you can find more historically about
25:08
this area. I
25:10
think those are very important, but
25:12
ultimately it will be shaped
25:14
by some kind of context, some
25:16
type of shared definitions. And
25:19
ontology is really more about sharing definitions
25:21
and making sure we're you know, describing the
25:23
same thing because I swear, you go
25:25
to a big company, use the word customer
25:27
in front of one VP, you
25:29
know, in sales, it means something different
25:31
to like the VP in charge of
25:33
procurement. So even like the
25:35
words themselves don't cross domains. The
25:38
graph is Basically our idea that we
25:40
know that there's connections like if you if
25:43
you do have your your operations data
25:45
But then you also have your your like
25:47
sales data, you know, there's some connections
25:49
across there It's not exactly the same but
25:51
some stuff is connecting so graphs show
25:53
where those connections are But I think you
25:55
know think about like the example of
25:57
Google Maps like there's different levels of detail
25:59
and of course any video game of
26:02
course has this too but you know if
26:04
you're taking satellite data and like trying
26:06
to stitch together a map you zoom in
26:08
you can see the beach and you
26:10
zoom in you see the car tracks and
26:12
you zoom in further at some point
26:14
you're gonna get to pixels right yeah and
26:16
you zoom out and maybe you see
26:18
this landscape of like a beach next to
26:21
the ocean but then probably you zoom
26:23
out at some level and they've got like
26:25
the name of the beach Right. So
26:27
there's like a high level detail. I think
26:29
graphs are much the same. There are
26:31
connections at the low level, like Ask News
26:33
is saying is like, you
26:35
know, here's a reporting from Zimbabwe.
26:38
This is like the reporters on the ground.
26:41
But then you zoom out and you're like,
26:43
okay, well, you know, what impact does
26:45
this have on our supply network? Do
26:47
we have to really make different plans? Is there
26:49
going to be like a war breaking out that
26:51
causes, you know, all those shipping containers to be
26:53
delayed by three months? I
26:56
think at some level you need
26:58
to think of the graphs as
27:00
sort of collecting higher and higher
27:02
into more abstracted, more refined
27:04
concepts, if you will. And
27:06
so the stuff at the low level is kind
27:08
of like, let's see how it all fits together. The
27:11
stuff at a higher level, it's like, oh, actually,
27:13
we can maybe do some inference on this, or we
27:15
can use this to help structure other data that
27:17
we're going to piece together. So,
27:21
Demetrius, you actually touched
27:23
up a really big subject
27:25
that thinks... Now, in
27:28
the exploratory
27:30
process, it's combined
27:33
with the questions. Knowing what
27:35
question to ask often is 80
27:37
-90 % of the work. So,
27:40
a prescribed thing to
27:42
give you the answer often
27:44
meets the point, or
27:46
meets the important subtleties. But
27:49
the problem is how
27:51
do you discover the question
27:53
you need to ask?
27:55
And so in the way
27:57
that our perception, our
27:59
visual perception, our brain is
28:01
a fantastic... I don't want to
28:03
call it a machine, or
28:05
I don't want to even call
28:07
it a tool, but has
28:09
this great power of see patterns
28:11
in the information. Like
28:13
we look out in the sky,
28:15
we see the cloud, we
28:18
have some... we have some kind
28:20
of, like you are a
28:22
performer, I look at your performance,
28:24
your dance, like the information
28:26
being expressed without being able to
28:28
verbalize it, to define it,
28:30
but you have to watch it
28:32
to feel that. Maybe
28:34
you watch it long enough, you stop be able to
28:36
describe it, you stop be able to say, oh,
28:38
this is, some things
28:40
is there. So in a way
28:42
that what the graph does is
28:45
the graph is a fantastic medium
28:47
for visualization. You look
28:49
at the information express it just
28:51
like how I will bring like
28:53
when we think about you Dimitrius
28:55
I immediately think about Paco because
28:57
we in the same part in
29:00
room together so that's association. So
29:02
this association of
29:04
multiple pieces of information entities
29:06
in the space, if you
29:08
visualize effectively, it helps you
29:10
to see the patterns, help
29:13
you to see all the
29:15
missing links, missing patterns, things
29:17
that get our attention. And
29:19
then we start be able
29:21
to formulate the question, to
29:24
formulate, to
29:26
answer the question. More
29:30
than a tabular data
29:32
structure, I have to say,
29:34
the graph really helps
29:36
us to engage our brain
29:39
in this way, to
29:41
spot important information. Just go
29:43
watch a dance performance. You
29:46
see something definitive
29:48
happening, but you
29:50
know it before
29:52
you engage your language
29:54
or ecological thinking. Afterwards,
29:58
things, concepts start to form,
30:00
and then you can start to
30:02
build things around it. Oh, dude.
30:06
How cool is that? You know
30:08
it before you can express it
30:10
in that way. Absolutely. I
30:12
think a lot of analytics workflow
30:14
is work the other way around. We
30:16
focus so much on building up
30:18
the queries, build up
30:20
the programs to
30:23
drive it. to
30:25
drive the answer.
30:29
But as Parkoon and we
30:31
in the investigative space,
30:33
we all know that too
30:35
often getting the hint
30:37
is 80 % of work. Like
30:41
if you know that you're being
30:43
attacked, you know that they came in
30:45
through some vector, there's probably some
30:47
set of machines that are compromised. You're
30:50
not seeing that. You're seeing where you
30:52
know, the bad things are happening, stuff
30:54
is being stolen or whatever. So
30:57
looking across your network, just building up a
30:59
graph of like the associations of what's happening
31:01
during an attack, there's some placeholders. There are
31:03
definite questions that could be generated like, which
31:05
machine was compromised? Maybe I should fix that.
31:07
So I think from the operational perspective, you
31:09
know, I mean, you kind of have to
31:11
think of, I mean, we do think about
31:13
that, right? We do think about like, how
31:16
do we identify those unknowns? But the
31:18
problem is that the more complex
31:20
the problem becomes the more that
31:22
those Unknowns are not something that
31:24
can really be charted. They have
31:26
to be sort of poked at
31:28
and explored Yeah, and I think
31:30
that's why way what you're saying
31:33
with the graph being this visual
31:35
medium that we can poke at
31:37
and we can explore and It
31:39
gives us a different perspective with
31:41
which we can work with and
31:43
wrestle with the data is some
31:45
something that I hadn't heard before,
31:47
but it makes complete sense. From
31:50
a historical perspective, in terms of
31:52
data, you know, something to bring
31:54
out would be to consider about spreadsheets,
31:56
because like spreadsheets are sort of my
31:58
go -to example. This is all in
32:00
tabular form. It's very, very sort of,
32:03
you know, left brain. Everything is
32:05
very buttoned down. But the thing about spreadsheets
32:07
that you never see is there is a really
32:09
complex graph behind it, and it only works
32:11
because of that. But they never
32:13
show that. They just show the tabular
32:15
part. But all the real knowledge and
32:17
dynamics and all the real information you're
32:19
capturing a spreadsheet is about those different
32:21
dependencies and how that graph functions. Classic.
32:25
Of course we don't see it, because that
32:27
would be absolute chaos for us. Mind
32:29
blown. The graph is
32:31
this front -text media for
32:33
this perceptive thinking. Well,
32:36
the challenge is like, when
32:38
we talk about graph, I think
32:40
that we need to really
32:42
really like the separate two things.
32:45
Graph in the media of information
32:47
capture and the graph in the
32:49
media to help us to
32:51
think. There are two different
32:53
things. Graph as information capture, the
32:56
sole purpose is to capture
32:58
information as precise as possible,
33:00
as complete as possible. You
33:02
want to capture as much
33:04
truth as possible. However,
33:07
graph as a way of thinking,
33:10
If you take the raw
33:12
graph captured, preserve a
33:14
lot of truth, well, the
33:16
problem is we can only
33:18
hold seven piece information I will
33:20
bring at any given moment.
33:23
We'll be overwhelmed by all those
33:25
graphs. If we think about
33:27
our brain, in that
33:29
way, even the vector
33:31
embedding, I call it an implicit
33:33
graph, because vector embedding gave
33:35
you a medium to compute the
33:37
similarity. Effectively, you
33:40
can construct a graph. Exactly.
33:43
You can manifest a graph
33:45
out of it. So
33:48
you will see that the
33:50
graph being captured at the layer,
33:52
at the stage that's really
33:54
designed to preserve the ground truth,
33:57
as much truth as possible. But
34:00
then you need a way
34:02
to work the data into
34:04
a form that we can
34:06
easily digest with our perceptive
34:08
power. That is a challenge.
34:10
This is also why, in
34:12
my mind, there is a lot of
34:14
graphs. In theory, people
34:16
know the graph is how we
34:18
think. Thus, it's important.
34:21
But in practice, that is
34:23
a barrier. And how do
34:25
you reconcile the need between
34:27
graphs as information capture medium
34:30
and the graph to support
34:32
our perceptive thinking medium? It's
34:34
a very different thing. just
34:39
going back to what you
34:41
were saying with, we can relate
34:43
each other because we're on
34:46
this podcast together. We've done stuff
34:48
together. Maybe there's certain things
34:50
that come up in our memories
34:52
that are going to be
34:54
the most pertinent to that graph
34:56
that we have in our
34:58
head, but it's never going to
35:01
expand more than seven hops
35:03
or seven different parts of that
35:05
graph. Have you
35:07
ever have your worked with there's
35:09
like a kind of I guess
35:11
rubric might be a way to
35:14
say it came out of Carnegie
35:16
Mellon out of CMU Jeanette Wing
35:18
had this idea of What's called
35:20
computational thinking? And so it's
35:22
sort of like a four -step process
35:24
of like breaking down a problem
35:26
and then being abstracted back out It's
35:29
really powerful and I've used a
35:31
lot in courses teaching people but I
35:33
think that there there may be
35:35
something Kind of emerging as
35:37
like graph thinking and so just
35:39
to throw out like a straw
35:41
man here This is kind of
35:43
thinking out loud, but one of
35:46
the things that we see in
35:48
like fin crime in financial investigations
35:50
is a kind of graph thinking
35:52
a four step process repeated over
35:54
and over where you know,
35:56
you you do your best to build
35:58
out this graph and it might have hundreds
36:00
of millions of nodes or billions of
36:02
nodes or some ginormous number, something beyond human
36:04
scale beyond beyond human comprehension. But
36:07
then step two, partition. So
36:10
like, can we break out this
36:12
enormous graph into some areas of
36:14
subgraphs of patterns that are interesting?
36:16
Like, hey, this this looks like
36:19
a really good customer or hey,
36:21
this looks like a money mule.
36:24
you know, fraud scheme. And
36:27
so you go, you do this dimensional
36:29
reduction then because you go from like five
36:31
billion nodes in a graph down to
36:33
maybe 10 or 20 that are interesting. And
36:36
so that's like, there are graph algorithms
36:38
like Louvain or like, you know, weekly commit
36:40
connecting components or there are different ways
36:42
to get down to that scale. And
36:45
in like in machine learning in general,
36:47
we're looking a lot of dimensional reduction,
36:49
right? So, Once
36:51
you've got down to that scale now
36:53
you can use other graph algorithms like
36:55
maybe between a centrality or different forms
36:57
of centrality to understand how are these
36:59
parts connected and Gosh, maybe there's like
37:01
one node in there who's orchestrating the
37:03
whole crime ring Which typically case there
37:06
might be like a person with a
37:08
bunch of shell companies, right? And they're
37:10
doing fraud So that's step three is
37:12
like leveraging certain types of graph algorithms
37:14
to sort of think of page rank
37:16
Let's bubble up to the top the
37:18
parts that are probably first
37:20
good steps to investigate. And
37:23
then step four, put
37:25
it through a work process.
37:27
And I mean, if you're working with people
37:29
in a bank, put it through case management
37:31
tools, you know, a level
37:33
A analyst gets assigned it, they go
37:36
and they start poking around the graph, they
37:38
do something interactive, they work with the
37:40
visualization, and they apply what they've learned. Or
37:43
you may have some agents involved
37:45
there too to help like summarize
37:47
and and and dig up part,
37:49
but it's a workflow So it's
37:51
kind of a four -step process
37:53
of sort of graph thinking if
37:55
you will that can be applied
37:57
and can integrate people and also
37:59
AI technology together Yeah, I want
38:02
to add one more thing to
38:04
Paco said it's really really important
38:06
to be able to narrow it
38:08
down to be a loop identified
38:10
things to reduce, reduce, reduce. But
38:12
there's also another aspect
38:14
which is a simplification
38:17
abstraction. Like very
38:19
often when you capture the data, you
38:21
don't really like the domain or
38:23
you don't need to know the future
38:25
question. So the domain is
38:27
wide. But we look for
38:29
the information and so the domain
38:31
is narrowed. When domain is narrowed,
38:33
for example, like I call Paco
38:35
as a math scientist, at some
38:37
point I can just refer Parko
38:40
as a math scientist. I don't
38:42
need to add information because math
38:44
scientist is Parko. And
38:46
that only valid in the
38:48
specific domain. So
38:50
the reason I say that is
38:52
because a lot of information
38:54
when you domain wide, like
38:57
I call it when
38:59
you capture information, I
39:01
prefer a pure edge
39:03
approach. Like in the
39:05
graph, edge has no
39:07
properties it's just edge it's
39:09
just association anything you need the
39:11
property means the things you
39:13
may need may need to amend
39:15
it up on maybe you have
39:17
something pointed to it or pointed
39:19
out to it you keep
39:21
it as a node now as
39:23
you're thinking very often like I
39:26
know Paco but I know Paco
39:28
this relationship I can carry a
39:30
lot of context in it already
39:32
I don't need additional information to
39:34
to show, to tell how
39:36
I know Pako, it just can
39:38
be in there. I know Pako
39:41
itself is sufficient. So
39:43
what that means is when
39:45
we present like I know Pako
39:47
that relationship as a single
39:49
relationship, right? In
39:51
the data layer, there might be
39:53
a tons, thousands or tens
39:56
of thousands piece of information there,
39:58
but it come out as
40:00
a one single piece of concise
40:02
information. I think
40:04
that is where I
40:06
think an analytic workflow
40:08
or visual analytic workflow
40:10
should be, is to
40:12
be able to go
40:14
from a very detailed,
40:16
broad, big, large information,
40:18
distill or aggregate down
40:20
to a simple representation,
40:22
but is grounded in
40:24
that particular domain, in
40:26
that particular context, so
40:29
for us to, so we can
40:31
communicate. We can communicate in
40:33
simple language rather than carry a
40:35
lot of information when we
40:37
had to. I know
40:39
Paco, that's it. We
40:41
don't need to know how we know each
40:43
other, where do we know each other
40:45
in certain contexts. Is
40:47
it almost like the data
40:49
underneath is like an
40:51
iceberg in a way and
40:53
you knowing Paco is
40:55
like the tip of the
40:57
iceberg. You have that one. Piece
41:00
of information, but then if you
41:02
wanted to get more granular you
41:04
can go down and see the
41:06
whole iceberg Yes way, could we
41:09
could we say then that? You
41:11
know, we pull everything. We connect
41:13
everything together. It's very noisy. We
41:15
can go up different levels of
41:17
abstraction. But to your point then,
41:19
we're going up levels of abstraction
41:21
in particular domains, like for purpose.
41:24
So we have some shared definitions.
41:27
And then we can start to
41:29
say, OK, now let's do our
41:31
Louvain partitioning or whatever. Then we
41:33
start to drill down into subgraphs.
41:35
It's like maybe a five -step process.
41:37
Yeah, even with Loving
41:40
community calculation or any
41:42
centrality calculation, the graph
41:44
has to be simple.
41:46
Because very often, I
41:49
think the graph we
41:51
talk about is I
41:53
call it the multi
41:55
-domain graph. It has
41:57
different type of information
42:00
in one graph. So
42:02
computing a centrality
42:05
in that kind of a
42:07
hypergraph as a hypergraph
42:09
is very challenging or what
42:11
does it mean as
42:14
a result if you mix
42:16
human and the emails
42:18
and it's difficult. So that
42:20
process itself to me
42:22
is we already need to
42:24
prepare our transform our
42:26
graph data into a form
42:28
that is suitable for
42:31
that centrality computation. Very often
42:33
like you have to
42:35
already project into a specific
42:37
domain for that computation
42:39
to happen. Very
42:41
good. That's what
42:43
I was thinking is like the
42:46
data that you have only becomes
42:48
relevant once you've narrowed it down
42:50
in a certain way and you're
42:52
looking at a certain plane of
42:54
that domain and you say, okay,
42:56
now we're going to be focusing
42:59
in on this plane. That's
43:01
when certain nodes
43:04
and certain data and certain
43:06
connections become relevant because
43:08
you're looking at that layer
43:10
almost in my head
43:12
if I visualize it. And
43:14
we're talking about that
43:16
Google Maps example again, you're
43:18
diving deeper and deeper
43:20
and you see different structures
43:23
depending on the layer
43:25
that you're looking at. And
43:29
and this fits very well with
43:31
like did a mesh kinds of concepts,
43:33
you know Jean McDogone Talking about
43:36
how different domains share you have to
43:38
abstract you have to come up
43:40
with the relations I think chat also
43:42
has the idea of like contracts,
43:44
you know where you have relations across
43:46
domains So you share some definitions
43:48
you have to you have to condense
43:50
down to that level before you
43:53
can go across domain so Yeah,
43:55
if we use the domains in
43:57
an organization to kind of guide when
43:59
and where and how do we
44:01
condense down, then we can
44:03
really take advantage of this
44:05
kind of abstraction. But it's
44:07
almost like I realized after
44:09
I said it, there's
44:11
two vectors or there's two
44:14
dimensions that you are
44:16
looking at when you are
44:18
zooming in or zooming
44:20
out because you're playing on
44:22
the field of
44:24
granularity, but you're also playing
44:26
on the field of the domain
44:28
and what is relevant in
44:30
that domain. So if we have
44:33
that X and Y axis,
44:35
you can get more granular inside
44:37
of the domain, but then
44:39
you can also just go on
44:41
the X axis and change
44:43
domains. And so that, like a
44:45
kaleidoscope, when you turn it,
44:47
you see a whole different set
44:50
of relations. Yeah,
44:54
and I mean in an enterprise context
44:56
this gets really bizarre because you know
44:58
you The people in the domains that
45:00
you depend on may not even know
45:02
that you're out there You know, you
45:04
may be consuming from some log files
45:06
from another application that are like totally
45:08
driving your product So like can we
45:10
have some sort of contract so that
45:12
we know about each other? But
45:15
yeah scooting across the domains.
45:17
That's the that's the key
45:19
challenge to like leveraging these
45:21
kinds of technologies because usually
45:24
You are in a particular domain when
45:26
you're making those decisions, but for most
45:28
applications you have to combine a couple
45:30
domains, right? So it's usually
45:33
like there's something interesting going
45:35
on between like sales and
45:37
procurement or or sales and
45:39
marketing or or you know
45:41
some other business unit So
45:43
usually oftentimes you will have
45:45
to combine and do you
45:47
then try and create Two
45:51
different graphs that are connected to each
45:53
other, or is it one larger graph?
45:55
How do you look at it in
45:57
that regard? Federation
45:59
sounds good. I think trying to
46:01
have one ginormous graph is usually...
46:03
weird. And those projects
46:06
usually don't ever end. But
46:08
federating and being able to go across
46:10
domains and say, okay, over there, let me,
46:12
let me send you something. I'd
46:15
like to know what you can,
46:17
what results can you bring back? So
46:19
are you making a prompt in
46:21
Graphrag across a different domain? Are you
46:23
making a query running some algorithm,
46:25
whatever? There's some kind of information transfer,
46:27
but federation. Yeah,
46:30
I can talk
46:32
about a couple my
46:34
personal experience. First,
46:38
bring information to graph is
46:40
a step forward, a
46:42
step up. Because
46:44
information as a tabular format,
46:46
it needs to be confined
46:48
to a very specific definitions
46:50
as pretty narrow domain. Graph
46:53
is, there's one example, I
46:55
look at the US flight
46:57
record. You can download it
46:59
from Department of Transportation. They
47:01
release every two weeks after. The
47:04
damn thing has 140
47:06
columns, I think. Really,
47:09
really wide. And
47:11
the reason is because the
47:13
flight may get diverted. Whenever
47:16
the flight gets diverted, you
47:18
add about 10, 15 columns
47:20
of information. So then
47:22
you need to capture that the flight
47:25
may be diverted more than once. twice
47:28
is that enough? No, three
47:30
times. Three is not no,
47:32
some is four times. So they
47:34
actually have five diversions. But
47:36
if you have six times too
47:39
bad, it cannot exist. So
47:41
that's the limits of
47:43
tabular format in the
47:45
information capture. With
47:47
the graph, it relaxes a lot. Naturally,
47:50
you can have a thousand diversions.
47:52
I don't care. You
47:54
can just like the graph
47:56
can keep a mind into
47:58
it. So that is really,
48:00
really a big improvement with
48:02
the graph to allow you
48:04
to have a lot more
48:06
flexibility in capturing the information. And
48:09
the other thing is like
48:11
very often in the tablet
48:13
format, it's very difficult to
48:16
check the mismatch. We
48:18
have example of bringing
48:20
two dataset manager from two
48:22
or three different departments
48:24
in the same organizations. Everybody
48:27
know other person's data
48:29
has a problem, but
48:31
you can't force other people to
48:33
fix it. But with the
48:35
graph, when you bring things together,
48:37
you immediately see the mismatches. And
48:39
that, so we have one example
48:41
of a company spend a couple
48:43
years, they could not reconcile the
48:45
data. But once they bring the
48:47
data into the graph, they start
48:49
to see the mismatch in one
48:51
month, they fix the data problem.
48:54
But they start to see the
48:57
mismatch because of the dependencies? Because
49:00
now, let's see,
49:02
you know the records are
49:04
unique, right? But
49:06
then when you link the
49:08
other record together, you need to
49:10
see, oh, this record is actually
49:12
duplicating other systems that they recorded
49:15
differently. Somebody made a mistake there.
49:17
Yeah. We see that a lot for
49:19
entity resolution work. You think like
49:21
a social security membership unique. But
49:24
then you're bringing in data from some other
49:26
sources. And there was an
49:28
application where maybe early on the product manager
49:30
said, yeah, we need to collect this
49:32
whole security number. And then later on they
49:34
said, oh no, we can't do that.
49:36
Just put it in a dummy number. And
49:39
so now you've got like this data
49:41
set that has, you know, 5 ,000 instances
49:43
of the same social security number. So once
49:45
you start to put in a graph,
49:47
you're like, wait, isn't that supposed to be
49:49
unique? How come there's like this enormous
49:51
node with like all these things connected to
49:53
it? Something's wrong. So
49:56
it's really also a great way
49:59
to figure out data quality issues.
50:01
Yeah. Although there's security.
50:04
I mean, going back to what we were
50:06
talking about before. if you are looking
50:08
in financial investigations, if you're looking in sort
50:10
of criminal investigation, okay, maybe
50:12
you've got some open data, like
50:14
here's, you know, sanctioned shell companies
50:16
or whatever. And then
50:19
maybe you've got some private information
50:21
like customers, but maybe you've also got
50:23
some feeds of like, oh yeah,
50:25
here's an active investigation. We're looking at
50:27
these people. But then
50:29
these particular people, they
50:31
have, you know, immunity
50:34
because they're diplomats. So
50:36
like there's all these different levels of
50:38
security and you start to pull it all
50:40
together in a graph. You get a
50:42
very comprehensive view. Maybe not everybody
50:45
can even see that. Like you don't,
50:47
you know, you don't want the police
50:49
officers who are doing parking tickets to
50:51
know that, you know, XYZ diplomat might
50:53
be investigated for a crime. Like that
50:55
information should not go out. So
50:59
where do you draw the line? Because
51:01
the graph really brings it all together.
51:03
But then how do you handle security
51:05
issues? The
51:07
access control with the
51:10
graph is automatically harder than
51:12
the tabular, the traditional
51:14
database. Well, it feels
51:16
like one of these, what
51:18
you were talking about, with
51:21
the ways that you visualize
51:23
it, you can
51:25
almost create different
51:27
access controls on
51:29
the visualizations. So
51:31
I don't know if you've thought through that.
51:33
in a way, but is that kind of
51:35
how you go about it? So
51:37
fundamentally, access control needs to be
51:39
in the data management layer. Like
51:43
if the database can
51:45
support access control, you're
51:47
great. We
51:49
actually, however, run into a situation
51:51
that database do not have the
51:53
sufficient access
51:55
control that supports business needs. So
51:57
in that situation, we actually
51:59
have to implement a future layer
52:01
in the data access. When
52:04
we put the data from the
52:06
database, depends on
52:08
the roles and teams,
52:10
and we actually
52:12
prohibit certain information from
52:14
being accessed. But
52:16
that's not a fundamental solution.
52:18
Fundamental solution has to be
52:21
in the data management layer.
52:23
It's a hard problem. In
52:25
previous work, which is
52:27
more like knowledge graphs
52:30
being used for large
52:32
-scale manufacturing, one
52:34
of the things we ran into
52:36
is security access because you take
52:38
procurement data, plus some operations data,
52:40
plus some sales data, put it
52:42
all into a graph. Suddenly,
52:44
you have a picture of how
52:46
the company works. But it's like a
52:48
really confidential picture. It's like maybe
52:50
the board could see this, but nobody
52:52
else in the company should see
52:54
it. So there's a real power
52:56
there, but there's always a risk. And
52:59
how do you manage that is
53:01
a mind -bogglingly difficult problem. I
53:05
read a book
53:07
talk about... the
53:09
certain like intelligentsia communities
53:11
when they go to another
53:13
countries. In the past,
53:16
you use like a falsified
53:18
identities, but today is
53:20
not good idea anymore because
53:22
all the open source
53:25
intelligence out there, even you
53:27
want to with help
53:29
some information, but people can
53:31
stitch together picture because
53:33
of a related piece of
53:36
information, sit there, outside
53:38
and the social media like
53:40
maybe there's a picture of
53:42
you with somebody that you
53:44
did not take a picture
53:46
did not post it but
53:48
somebody posts on Instagram and
53:50
so all those information out
53:52
there can essentially is a
53:54
graph can link back to
53:57
you even though you turn
53:59
really hard to stay hidden
54:01
at that. That's the
54:03
fundamental problem in terms of
54:05
privacy security, or you want
54:07
to control the access information,
54:10
but because you have all
54:12
those connections in the
54:14
graph, that make it really
54:16
hard. And a corollary
54:18
with that, when I talk
54:20
with people in enterprise who are
54:22
doing large -scale knowledge graph practices, the
54:25
one thing that I keep
54:27
hearing over and over again is
54:29
companies using graphs for market
54:31
intelligence, or maybe sometimes you would
54:33
say competitive intelligence. But
54:36
a lot of this might be
54:38
for sales win -back strategies, trying
54:40
to understand who's the competitor that got our bid
54:42
away from us. How can we go back
54:44
and try to... give
54:46
them a better quote. Oh,
54:48
wow. And so I've heard this
54:50
over and over again. We're like, that's
54:53
one of the first graphs that
54:55
starts making a lot of money is
54:57
like literally doing intelligence inside the
54:59
enterprise. Yeah,
55:01
I was going to go down
55:03
that route of like, let's talk
55:05
about a few other cool use
55:08
cases that you have seen, whether
55:10
it's just graphs, or it
55:12
is graph rag, which is
55:14
a hot term these days, you
55:16
know? I
55:19
mean, you know, it's
55:21
interesting. There's a lot
55:23
of graph database vendors, and they really kind
55:25
of lean heavy on the graph query
55:27
side of how to run this. And that's
55:30
something that's very familiar with people in
55:32
data engineering, data science, you
55:34
know, using a query. But I
55:36
think in the graph space, there
55:38
are other areas that aren't query
55:40
first, like using graph algorithms or
55:42
using There's a whole
55:44
other area of what should be
55:46
called statistical relational learning, but you know,
55:48
you've probably heard of like Bayesian
55:50
nets or causality or different areas over
55:53
there of using graphs. But
55:55
then there's also graph neural networks,
55:57
like how can we train deep learning
55:59
models to like understand patterns and
56:01
try to suggest, hey, I'm
56:03
looking at like all the contracts you
56:06
have with your vendors. And
56:08
I noticed that these three here are missing some
56:10
terms. Do you, you know, is that a
56:12
mistake? So I
56:14
think that, you know, there's,
56:16
there's the queries, there's the algorithms,
56:18
there's the causality kind of,
56:20
you know, that
56:25
area of, there's
56:27
also the graph neural networks. There's
56:29
a few other areas too, but these
56:31
are These are all like different camps
56:34
inside of the graph space. They don't
56:36
always necessarily talk with each other, but
56:38
I think it's really fascinating now that
56:40
we're starting to see more and more
56:42
hybrid integrations of them. Yeah.
56:46
I like to point out
56:48
the fundamentally graph and table
56:50
are two sides of the
56:52
same coin. As
56:54
a physicist, we
56:56
look at the sound, music.
56:58
both from frequency domain like
57:00
is a c d e
57:02
f what's the frequency distribution
57:05
and also look at what
57:07
waveforms like time time domain
57:09
like some some situation you
57:11
want to filter or you
57:13
want to access more on
57:15
the frequency domain some sometime
57:17
makes more sense on the
57:19
waveform domain the the same
57:21
data like like graph essentially
57:23
is a giant If
57:26
you think about the
57:28
large language model neural network,
57:31
it's a graph, but
57:33
it's a gigantic,
57:35
extremely sparse matrix, which
57:38
is table, right? And
57:40
the fact back because
57:42
it's such a giant sparse
57:44
matrix causing today, NVIDIA
57:46
is really hard because NVIDIA
57:48
has these GPUs that
57:50
can process those matrix. But
57:52
guess what? My
57:55
brain consumes about 19
57:57
watt energy. The
57:59
GPU running large -language
58:01
model consumes tens of
58:03
thousands of watt of
58:05
energy to get similar
58:08
computation needs. And
58:10
that's extremely inefficient. Even
58:12
though the computation unit is
58:14
much smaller than my neuron, you
58:17
think it should suppose to
58:19
compute a higher efficiency. That's
58:21
precisely because they're dealing with
58:23
extremely sparse matrix. They're not dealing
58:25
the neural network as a
58:27
graph. They're dealing neural network as
58:29
a matrix. And that's fundamentally
58:31
the problem for the power efficiency.
58:34
So there are certain models
58:36
that come up that really
58:38
deal with AI as a
58:40
graph that several automattitude save
58:43
in energy consumption. So
58:45
in the real world application, The
58:48
one reason why Graf hasn't been taken
58:50
off as we all think for the
58:52
past 20 years like oh Graf gonna
58:54
take off, Graf gonna take off, but
58:56
no it did not. The
58:58
fundamental problem is because
59:01
we are so familiar with
59:03
all the tools and
59:05
methodologies like workflows is well
59:07
established in the tabular
59:09
based way of thinking. It's
59:11
like the dependent transportation
59:14
do not release the flight
59:17
data as a graph they
59:19
released as a table
59:21
is easy to access we
59:23
have all the toolings
59:25
that mature to change that
59:27
is extremely difficult so
59:29
in the way I would
59:32
argue that AI is
59:34
always almost made for graph
59:36
because AI suddenly allow
59:38
you to process unstructured information
59:40
like emails reports this
59:42
like a podcast transcriptions like
59:44
videos into a
59:46
structural form that computer can access.
59:49
But guess what? It
59:51
is a graph that AI
59:53
will convert those data into. So
59:56
now you suddenly have this, some
59:58
people argue, I think it's like
1:00:00
80 % of the information existing
1:00:03
on structural form. Some people argue
1:00:05
that even the percentage even larger. So
1:00:08
the AI suddenly make
1:00:10
this like the majority
1:00:12
of the information available
1:00:14
for analytic workflow at
1:00:17
assessment. And
1:00:19
the funny thing is, you need
1:00:21
graph to do that. So
1:00:23
in the way that my
1:00:25
assessment is, because of AI,
1:00:27
because of AI, we're
1:00:30
actually entering the boom,
1:00:32
like exponential growth error
1:00:34
of a graph, because
1:00:36
the availability in the data.
1:00:39
It's like the internet of
1:00:41
things. We've been waiting
1:00:43
for it to happen since
1:00:45
2010 or 2005 whenever
1:00:47
and it's always just around
1:00:49
the corner. But now
1:00:52
it does make sense that if you
1:00:54
have all of this unstructured data and
1:00:56
you have these relations, then that sounds
1:00:58
like a graph to me. Yeah.
1:01:01
And going back to
1:01:04
like 1980s era, hard
1:01:06
AI, you know, whether we're
1:01:08
talking about like A star B star
1:01:10
kind of algorithms or talking about planning systems,
1:01:12
all of these were expressed as graphs. And
1:01:15
like, you know, some of the early
1:01:17
thinking that that was like pre -Google
1:01:19
that led to Google, they were talking
1:01:21
about graphs. Some of that
1:01:23
work actually came out of like groupware,
1:01:25
but based on graphs. So it's there. Funny
1:01:28
you say that because we
1:01:30
had one of the talks at
1:01:32
the AI Quality Conference back
1:01:34
last year. was from the
1:01:36
guy who created Docker, Solomon. And
1:01:39
his whole talk was really like,
1:01:41
everything's a graph. If we
1:01:43
really break it down, it's just, it's
1:01:45
all graphs and how one thing relates
1:01:48
to another thing. I'll throw, I'll
1:01:50
throw something else in to kind of
1:01:52
go back to our early part. We
1:01:54
were talking about East meets West. There's
1:01:56
a book that A really
1:01:59
favorite book, though, from early days.
1:02:02
This is like going back to the early
1:02:04
90s, but early days of neural networks. About
1:02:07
this idea of like, yeah, there's
1:02:09
some conventions in the West, maybe we
1:02:11
can back off. It's by
1:02:13
USC professor called Bart Kosko. It's
1:02:15
called Fuzzy Thinking. And
1:02:17
sort of his critique of
1:02:19
science, but more from a lens
1:02:22
of more Eastern perspectives. I
1:02:25
know that this book is like more than
1:02:27
30 years old, but I think that there's
1:02:29
some really great perspectives there that weigh in
1:02:31
a lot, especially what Wei was saying about
1:02:33
like, where are we now with LLMs and
1:02:35
how we're leveraging this in the context of
1:02:37
graphs. So
1:02:40
I think the other thing,
1:02:42
was there anything else that you
1:02:44
guys wanted to talk about
1:02:46
before we jump? I know
1:02:48
there's a lot of cool data
1:02:50
visualization stuff. that you're doing way.
1:02:52
Yeah, I just want to add
1:02:54
one thing. I
1:02:57
just want to say
1:02:59
the visualization is not the
1:03:01
end. The
1:03:03
goal is to support analytics.
1:03:07
So I know everybody when it
1:03:09
comes to the graph, talk
1:03:11
about graph visualizations. But
1:03:13
in my mind, what's
1:03:15
really what we need is visual
1:03:17
analytics. How can
1:03:19
we visually transform the information?
1:03:21
How can we visually go
1:03:23
from like information that was
1:03:26
suited for data management, for
1:03:28
data capture, that was so
1:03:30
you can access, work them
1:03:32
step by step towards information
1:03:34
that's suitable for presentation for
1:03:36
answering the specific questions in
1:03:38
that particular domain. So
1:03:41
that steps requires a transformation
1:03:43
of data is not just
1:03:45
like a filter. but
1:03:47
also fundamentally in the
1:03:49
graph schema mutation. The
1:03:52
schema you have for the
1:03:54
data capturing is not a schema
1:03:56
suitable for presentation. There are
1:03:58
two different things. If
1:04:01
you think about in the big data era, the
1:04:04
development of the map
1:04:06
reduce allow you to
1:04:08
have this step -by -step
1:04:10
flow of information from
1:04:12
the original captured. tabular
1:04:14
format into a very
1:04:17
different table that you
1:04:19
can present. In
1:04:21
graph, it's the same thing
1:04:23
that the graph anything needs is
1:04:25
a step -by -step, like we
1:04:27
call it calculus or operators, to
1:04:29
transform your data from the
1:04:32
form that's been captured to the
1:04:34
form that you want to
1:04:36
present to answer the question. Now
1:04:39
that calculus It's
1:04:42
based on, I think it
1:04:44
needs to be in two
1:04:46
forms. It needs to be
1:04:48
in the form that you
1:04:50
can process data in large
1:04:52
quantity, like a large graph
1:04:54
step by step mutates. But
1:04:57
also needs to be visually. You
1:04:59
need a same set of, a
1:05:01
parallel set of operator
1:05:03
that a data analyst,
1:05:06
but ideally a domain
1:05:08
expert, not a data. not
1:05:11
somebody who can write
1:05:13
Python or Cypher queries or
1:05:15
GQL. But somebody
1:05:18
with the domain knowledge, look at it,
1:05:20
because graph is so visual. You're
1:05:22
like, hey, I want to
1:05:24
simplify this. Oh, I know
1:05:26
Paco and Wei has so
1:05:28
many meeting points. Let's abstract
1:05:30
that out. Let's just create a
1:05:33
single reading stream that Wei inference,
1:05:35
like Wei and Paco, that they
1:05:37
know each other. and get
1:05:39
rid of all the other information. So
1:05:42
this all maybe say, hey, Parkour
1:05:44
knows a million people. Maybe I
1:05:46
underestimate a little bit of Parkour.
1:05:48
So sorry about that. No
1:05:50
kidding. You probably know more than that. But
1:05:52
let's from the graph, we can quickly compute
1:05:54
this number and put it in the Parkour,
1:05:56
make Parkour very, very big because Parkour knows
1:05:58
a million people. So
1:06:01
that kind of operation is
1:06:03
highly intuitive. So I
1:06:05
want to stress this. The
1:06:07
visualization for graph is not
1:06:09
end. The visualization for graph
1:06:11
is tool you use to transform
1:06:13
the graph to get you the answer.
1:06:15
That's a way point. Very
1:06:17
good. Yeah, that
1:06:20
is very in line with
1:06:22
what you were saying earlier
1:06:24
on how when you don't
1:06:26
know the question, that's sometimes
1:06:28
the hardest part. And so
1:06:30
being able to wrestle with
1:06:33
the data in different forms,
1:06:35
one being the visualizing it
1:06:37
in different ways. That's one
1:06:39
tool to hopefully help you
1:06:41
get to the answer or
1:06:43
first step, the question, which
1:06:45
can then lead to the
1:06:47
answer you're looking for. Yeah.
1:06:50
And to mutate the graph
1:06:52
visually. So you
1:06:54
can start poking it.
1:06:57
Yeah. Yeah, exactly.
1:06:59
It does feel
1:07:01
like the ability to
1:07:04
just mutate
1:07:07
the graph is such a
1:07:09
strong tool. Because of
1:07:11
all these different reasons that
1:07:13
we had mentioned when it comes
1:07:15
to the depth and the
1:07:18
way that you're able to look
1:07:20
at the domains or you're
1:07:22
able to just find anomalies or
1:07:24
find different data quality issues,
1:07:26
whatever it may be, whatever your
1:07:28
use case is, it's very
1:07:30
cool. It does sound though instinctively
1:07:33
a bit manual though, right? So
1:07:37
far I think way has
1:07:39
brilliant examples what they're doing like
1:07:41
with site XR of leveraging
1:07:43
3d visualizations zoom in zoom out
1:07:45
in conjunction with algorithmic ways
1:07:47
Using graph algorithms to sort of
1:07:50
focus the lens focus the
1:07:52
search light I think that more
1:07:54
can be automated over time
1:07:56
and maybe this is where agents
1:07:58
come in is actually helping
1:08:00
determine How to how to be
1:08:03
the cinematographer there on the
1:08:05
graph? Yeah So there's definitely a
1:08:07
way of helping you to
1:08:09
look at perspectives. And
1:08:11
very often we deal with the data
1:08:13
that's both graph -connected nature, but it's
1:08:15
also dimensional. Each
1:08:18
node has so many properties.
1:08:20
Each property is a dimension. So
1:08:22
it's high -dimensional information. So
1:08:25
which dimension set do you want
1:08:27
to take in combination with the
1:08:29
network? information to help you to
1:08:31
see, be able to have a
1:08:34
versatile way, flexible way of choosing
1:08:36
the dimension set, or it's very
1:08:38
often like when you shift from
1:08:40
one dimension to the other dimension,
1:08:42
you reveal some floccings of things
1:08:44
going together, some clustering that are
1:08:47
happening, it really says, hey, those
1:08:49
things always move in the same
1:08:51
direction. So those signals
1:08:53
help you to formulate a
1:08:55
lot of ideas, instincts from
1:08:57
the data. And then when
1:08:59
you see that information, the next thing you want
1:09:01
to know, hey, I want to capture that as
1:09:03
a feature. Now,
1:09:05
can you represent that as
1:09:07
a feature to that
1:09:10
become what you see become
1:09:12
a thing that become
1:09:14
an entity in your visualization
1:09:16
that you can put
1:09:18
back in there. That
1:09:20
is the visual
1:09:22
analytics. Whoa.
1:09:26
So capturing it as a feature
1:09:28
and then you can feed it
1:09:30
into the tabular data in a
1:09:32
way. Yes, exactly. Guys,
1:09:35
this is awesome. Is there
1:09:37
anything else that you want to hit on before
1:09:39
we stop? I feel like I've learned a
1:09:41
ton just from talking to you. I knew it
1:09:43
was going to be great conversation. I was
1:09:45
hanging on to my seat this whole time. It's
1:09:47
like, oh my God, I'm much. I learned
1:09:49
a lot too. Yeah. In
1:09:51
terms of cross -domain, I
1:09:54
want to show one funny
1:09:56
example, like how difficult cross -domain
1:09:58
is. So in this
1:10:00
example, it's an extreme
1:10:02
cross -domain. So I organize
1:10:05
a kind of tech arts,
1:10:07
dance and science, like
1:10:10
nonprofit. So
1:10:13
one thing we do every
1:10:15
week, every Wednesday, we bring people
1:10:17
in the engineers, science domain
1:10:19
and people in the dance, art,
1:10:21
music, domain together, we
1:10:23
explore something together and have
1:10:25
a conversation. The very first
1:10:27
meeting, when we bring
1:10:29
people together, that happened about
1:10:31
like 11 years ago. We
1:10:34
had about 20 people sitting
1:10:36
the room, everybody like a very
1:10:38
vibrant conversation. And then
1:10:40
that's the sudden realized something
1:10:42
that it is true that
1:10:44
everybody speak English, but nobody
1:10:47
can understand each other. Because
1:10:51
they're using simple
1:10:54
cavities. But because of
1:10:56
domain, just like Paco
1:10:58
talked about earlier in the enterprise
1:11:00
setting, because of the domain
1:11:02
difference, they mean
1:11:04
totally different things. A
1:11:06
physicist talk about energy, we have
1:11:09
very concrete things that we call
1:11:11
energy. A
1:11:13
dancer call energy is
1:11:15
a very different way
1:11:17
of energy. When
1:11:19
the computer people talk
1:11:22
about Python, we're not
1:11:24
talking about a snake.
1:11:27
But the dancer, when they hear Python, they're like,
1:11:29
why are you bringing a snake to the
1:11:31
conversation? So
1:11:35
I think just
1:11:37
accurate what Parker said
1:11:39
earlier in the enterprise data
1:11:41
context, that
1:11:43
domain. is very,
1:11:45
very important to be aware of
1:11:48
the domain, knowing the limit
1:11:50
of the domain and how to
1:11:52
find a way to cross -domain. For
1:11:54
us, it's generally a lot of compensation.
1:11:56
I think it's a human problem.
1:11:58
It's not a technical problem. Techno
1:12:01
can help, but only do
1:12:03
that much. We
1:12:06
had a conversation on
1:12:08
here a few months ago
1:12:10
with folks who had
1:12:12
created a data analyst agent.
1:12:14
and they said one
1:12:16
of the hardest parts for
1:12:18
the success of this
1:12:21
agent was to first create
1:12:23
a glossary of business
1:12:25
terms so that the agent
1:12:27
and really trying to
1:12:29
nail down these fuzzy words
1:12:31
and these words that
1:12:33
maybe for one person they
1:12:36
mean one thing and
1:12:38
another person they mean another
1:12:40
thing and the quintessential
1:12:42
example of this is an
1:12:44
MQL When you're at one
1:12:46
company an MQL or when
1:12:48
you're on one team an
1:12:50
MQL is one thing and
1:12:52
when you go to another
1:12:54
team an MQL is another
1:12:57
thing they all mean marketing
1:12:59
qualified lead but When does
1:13:01
that person become a marketing
1:13:03
qualified lead? What do they
1:13:05
have to have done or
1:13:07
what stage are they in
1:13:09
and so the agents may
1:13:11
understand and the LLMs understand
1:13:13
what an MQL is kind
1:13:15
of, but you really have
1:13:17
to flesh out this glossary
1:13:19
to let them know all
1:13:22
of these different terms that
1:13:24
you use and that are
1:13:26
in your database. So
1:13:28
when the agent needs to go
1:13:30
and pull, how many MQLs did
1:13:32
we have last week? It
1:13:34
understands what that means. Yeah,
1:13:37
that's your semantic layer right
1:13:39
there. That's your that's a controlled
1:13:41
vocabulary that you put enough
1:13:43
these together you get your ontology
1:13:46
Yeah, yeah, yeah exactly
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More