Kusto: Seasonality and Holidays

I’m Mike O’Neill and I’m a data nerd who uses Azure Data Explorer, or Kusto, every day to glean insights into Azure’s developer and code-to-customers operations.

One of the challenges I face is handling seasonality and outliers. For example, large numbers of Microsoft employees take vacation three weeks every year: Thanksgiving week, Christmas and New Year’s.

I analyze what thousands of developers do and those weeks always have low activity, so I have to figure out how to gracefully handle that seasonality.

In this video, I’m going to show you how I used two built-in features of Kusto: startofweek and range, to develop a little function that finds those holiday weeks no matter what year we’re looking at, and making it easier for those engineering managers to do it as well at the same time.

Here’s a visualization of a certain type of developer activity related to bringing new features to production. Those dips are the weekends. Yeah, sometimes we work weekends, but Microsoft prides itself on work/life balance, and so the bulk of activity happens Monday through Friday.

Step one is to group the data to match the activity I’m measuring: by week, not by day. The startofweek function does this nicely, and… while it is a simple function, it’s also really powerful because of that simplicity. There’s no need to decide whether Sunday or Monday is the first day of the week, and those engineering managers can pick it up in seconds and use without being reminded. Puts us all on the same page.

Startofweek smooths things out quite nicely, doesn’t it? But now you can see my problem: those six sharp downward spikes: that’s U.S. Thanksgiving, Christmas and New Year’s. Whether I exclude them or replace the values with an average, I need to identify them dynamically.

But I can’t just exclude the 24th and 25th of December, for example: I have to exclude the entire work week, and each year, those holidays either fall on different days of the week, or on a particular Thursday in November.

Let’s tackle Christmas and New Year’s first:

Step 1: Create a little data table with numbers for month and day.

Step 2: we use the range operator, which lets you create a list of numbers or dates in series. It’s created as a blob of structured text in a new column.

Step 3: we need to explode that blob of text out so that I have a row for each date, like it was a cross join. For that, we use the mv-expand operator.

Step 4: is the easy part. I restrict the rowset to only dates where the month and day match my list of holidays, and then…

Step 5: use startofweek again.

This pretty much works. Except when Christmas and New Year’s fall on a weekend or on Monday. Look at 2016 for example: both these holidays fall on the first day of the week, meaning that New Year’s week ends on January 7th. That was a regular work and school week. To fix that, all I need to do is switch my holidays to Christmas Eve and New Year’s Eve.

What about Thanksgiving? In the U.S., that’s the Thursday in the fourth week of November.

Step 1: Again, I start with a little datatable, but in this case, instead of the numbered day of the month, I need the week for that month, and the day of the week. Kusto uses a timespan of 4 day to represent Thursday, rather than an integer.

Step 2: Again, use the range operator to generate a set of date and…

Step 3: Use the mv-expand operator to explode this out in a cross join.

Step 4: is where things change from the previous example. I grab only Thursdays from the month in question, November.

Step 5: I order the data. This is critical: kusto won’t order things for you. You might think the range operator would land things in order, but it may not.

Step 6: use the row_number operator so that you know which is the fourth Thursday in November.

Step 7: Use startofweek to find the Sunday before Thanksgiving.

That’s it. Now, all I need to do is train my engineering execs to use startofweek, and then do a left anti-join to remove data from those weeks.

Here’s the full code for the function

.create-or-alter function with (folder = @'') SeasonalityWeeks
(
rangeStart:datetime = datetime("2016-01-01")
,rangeEnd:datetime = datetime("2022-01-01")
)
{
let _rangeStart = iif(rangeStart > rangeEnd,rangeEnd,rangeStart);
let _rangeEnd = iif(rangeEnd < rangeStart,rangeStart,rangeEnd);
let _majorFixedHolidays =
datatable(Month:int,Day:int,Name:string)
[
12,24,"Christmas",
12,31,"New Year",
];
let _majorVariableHolidays =
datatable(Month:int,DayOfWeek:timespan ,WeekOfMonth:int,Name:string)
[
11,timespan(4d),4,"US Thanksgiving",
];
_majorVariableHolidays
| extend Date = range(_rangeStart,_rangeEnd,1d)
| mv-expand Date to typeof(datetime)
| where Month == datetime_part('Month',Date)
| extend Weekday = dayofweek(Date)
,Year = datetime_part('Year',Date)
| where Weekday == DayOfWeek
| order by Year asc, Date asc
| extend RowNum = row_number(1,prev(Year) != Year)
| where RowNum == WeekOfMonth
| project HolidayWeek = startofweek(Date)
, Name
| union kind = outer
(
_majorFixedHolidays
| extend Date = range(_rangeStart,_rangeEnd,1d)
| mv-expand Date to typeof(datetime)
| where Month == datetime_part('Month',Date)
and Day == datetime_part('Day',Date)
| project HolidayWeek = startofweek(Date)
, Name
)
| project Name
, HolidayWeekStart = HolidayWeek
, HolidayWeekEnd = datetime_add('Day',6,HolidayWeek)
}

Improving Ancestry com’s MyTreeTags feature

I’ve been so focused on DNA ThruLines and the hints system that I didn’t notice Ancestry.com’s new tagging feature. Tags have been around a long time, and it’s nice that Ancestry.com added this capability.

But… it seems a half-baked effort.

  1. There’s no obvious warning to other researchers when I flag something as unverified or a hypothesis.
  2. Ancestry isn’t helping me ignore the “old” method of using icons.
  3. Tags have no visual impact in tree view: for example, the “no children” tag doesn’t replace my “no children” gender neutral child on my tree.
  4. The “direct line” tag isn’t well thought out. Why can’t I just click myself and activate this along my direct line?

My biggest beef is about the research tags.

On a personal level, I’d love the research tags to appear in more places. One example: I complain a lot about ancestry.com’s poor hint quality, and I will go to “All Hints” page and ignore hundreds of hints for people I don’t care about. But I don’t always remember which profiles I stopped looking at because they’re brick walls. The research tags should appear next to the profile name here to remind me.

But the biggest miss for research tags is communication to other genealogists. One of my biggest fears is that someone else will take a wild guess of mine and copy it. I have one hypothesis in my tree by the name of “Wild Speculation Chew.” And then after I discover my wild guess was wrong and remove it, someone else will copy the copy of my wild guess. And ten years later, there are dozens of trees with my random guess.

Someone actually contacted me and made a joke about the crazy names they gave people back then.

 Add an example of family tree search and tags

My second complaint is centered around how this feature feels like Ancestry.com is attempting to standardize all the crazy little hacks we all make to help track our research. But ancestry isn’t making an effort to help us ignore those little hacks now that there’s a better option.

For example, I put question marks as the suffix of a person’s name when I’m not convinced I have the relationship right, and create a gender-unknown child named “No Children” when a person didn’t have any kids.

Other people add little icons of angels and immigrant ships. I hate little icons. Well, no, I hate that ancestry serves up those icons as hints. I hate that so much I have a whole video about how easy it would be for ancestry.com to use artificial intelligence to categorize images, and give me the choice of suppressing hints for the image categories I don’t want to see.

My third suggested area of improvements is visualizations for the tags. The central experience in ancestry.com, for me at least, is the tree view. Sometimes I start by searching for an individual, but at some point, I traverse my tree in tree view.

Take a look at Thomas Kirk Plummer, here. I put some tags, but at a glance, all I can see is my ?. To see those tags, I have to click on his name and then expand the tags section.

OK, that’s not too bad. But for standard, un-customizable tags, why not create an additional visualization that is immediately visible for a handful of tags?

For example, research status tags could appear in on the right side of the profile pic, with unverified as a question mark, verified a check mark, hypothesis a light bulb, actively researching a magnifying glass, etc.

No children could have a small stop sign at the bottom. A brick wall could have… well, a little brick wall across the top.

My fourth area for improvement is about the “relationships” bucket of tags, specifically the “Direct Line” tag. That is just screaming out to me as a place for improvement.

On my wife’s tree, for example, I can trace back to fifth-great grandparents on almost every branch. That’s 254 people to tag as “Direct Line” and at five clicks per person, that’s 1,270 clicks, just for my wife. And I have several other lines of ancestry, including my own. Really, I’m never going to use that tag. Too much work, too little value.

But what if I could click my wife’s profile and choose an outline color for her direct line ancestry? Two to three clicks, and this could turn into this.

Oh, and why colors? Because there’s a point in my wife’s family tree, ten generations back, where she intersects with my sister-in-law’s family tree. In that case, the square around their common tenth-great-grandparents could show both colors. And I did not realize they were distant cousins for months.

Why your Scotch-Irish ancestors moved so frequently

Do you have ancestors who move frequently but not far? Say, showing up in 1790s Shelby County, Kentucky, then Bullitt County in 1800, then Grayson County in 1810? Or perhaps Hamilton County, Ohio in the late 1790s then Montgomery County in 1803 and finally Darke County in 1810?

There are two factual scenarios at play here:

Fist, your ancestors stayed in place but the map changed: that’s what happened in my Ohio example. I covered this in a previous video, check it out.

Second, your ancestors really did move a lot. But why did that family move so frequently when another family in your tree stayed put for decades?

I want to thank Karla York for suggesting this as a topic for a video. She was responding to a comment where I noted that ethnic German immigrants to the United States practiced a crop rotation strategy which kept their land productive and fertile, while Scotch-Irish backcountry pioneers would farm a patch of land for a few years until it was deprived of nitrogen, and then move on to the next.

To be honest, that story is something my mother has told me for years, not something I had researched. Turns out it’s true, but it was just one factor in why some of your ancestors made lots of little moves.

What really drove this, I think, was culture, specifically Scotch-Irish culture, and specifically in the geographical region dubbed Greater Appalachia where the Scotch-Irish settled.

By culture I mean how people lived their lives, from marriage and sex, to how you built your house, to what you cooked. It’s the stuff you learn from your parents and your community about how to survive.

My favorite author on colonial culture, David Hackett Fischer, summarized Scotch-Irish culture in my favorite book on colonial culture, Albion’s Seed this way:

The [Scotch-Irish] were a restless people who carried their migratory ways from Britain to America… The history of these people was a long series of removals—from England to Scotland, from Scotland to Ireland, from Ireland to Pennsylvania, from Pennsylvania to Carolina…

Fischer cites the example of the village of Fintray: between 1696 and 1701, three-quarters of the population turned over. The same pattern showed up in Appalachian Virginia, where 80% of the people living in Lunenburg County in 1750 were gone by 1769, with half of that movement occurring between 1764 and 1769. Fischer asserts that “these rates of movement were exceptional by eighteenth-century standards.”

Those migrations, in both the borderlands between England and Scotland, and in the colonial backcountry, were short-distance, “as families search for slightly better living conditions. Frequent removals were encouraged by low levels of property-owning.”

A folk-saying from the southern highlands gives you a better idea of how people felt. “When I get ready to move, I just shut the door, call the dogs and start.”

That feels pretty extraordinary. What will you eat? How could you just walk away from your labor investment in crops? What about your tools, your plow?

The answer is culture once again. The Scotch-Irish weren’t farmers the way we might think of colonial farming, with acre after acre of corn and wheat. They combined livestock herding with vegetable gardens and some grain. And they didn’t have a lot of tools: Fischer cites an early 1700s primary source that colonial backcountry Scotch-Irish had “one axe, one broad hoe and one narrow hoe.”

When you picked up and moved, you packed up some produce, a few tools, and then herded your livestock a few miles to a new spot. In Scotland, it was sheep, in the colonial backcountry, pigs or cattle.

Of course, it wasn’t quite so unplanned as it sounds. In The Monongalia Story a history of one region of West Virginia, Earl Core wrote that:

“A small group of men might come in winter or early spring, build their first cabins, clear and fence their little fields, plant potatoes, corn, beans and pumpkins. After the crops were well started… the men would ride their horses back [to their family’s current residence], again load them with [the rest of their possessions], and return with their family.”

The collaborative nature of this migration shouldn’t be discounted. American culture lionizes the rugged individualist pushing back the frontier, but that was a myth. Frontier migrations were a community affair, and the greater the distance, or the deeper into the territory of another culture that would try to repel what to them was an invasion, the more critical it was to band together.

Fischer notes that the first settlements in Tennessee and Kentucky were centered around military-like forts and stations, where settlers living nearby could retreat for mutual defense.

Core noted that the forts were also the center of the community, where “young couples danced and courted, where marriages were performed and funerals held, where land claims were recorded and justice meted out.”

As the native populations were pushed out & settler control secured, the Scotch-Irish spread out. As one North Carolina congressman put it, “no man ought to live so near another as to hear his neighbor’s dog bark.”

There’s only so much I can pack into a video of less than five minutes. If you want to learn more, get a copy of Albion’s Seed. It’s dense and long, but I think it’s worth it.

So… what of the bit about the Scotch-Irish moving because they wore out the land? It’s a bit of a chicken-and-egg scenario, isn’t it? If your culture is to move frequently, you didn’t need to maintain the fertility of your land.

The Scotch-Irish did have a way to re-fertilize land, however. Fischer quoted a traveler to the southern backcountry who noted “A fresh piece of ground… will not bear tobacco past two or three years unless cow-penned; for they manure their ground by keeping their cattle… within hurdles, which they remove when they have sufficiently dunged one spot.

Why did you accept that hint? A new ancestry.com feature.

Will you look at this? Ancestry.com is asking me why I accepted a hint. Or ignored it. Or said “maybe.”

I have the “beta features” flag turned on for ancestry.com. Despite working in the tech sector, I’m not really an early adopter, but I’ve been so frustrated with ancestry’s service (and with all my brick walls) that I figured it was worth getting to the bleeding edge.

I haven’t seen an announcement for this, but this feels huge to me.

My gut is that ancestry.com is evaluating its hint model—which is at least partially driven by one user adding a record to a profile that matches one of the profiles in my tree. That model assumes that all user input is accurate, when we all know that’s not true.

The short-hand for folks like me in the data analytics world is the phrase “garbage in, garbage out.” If your dataset is garbage, your analysis will be garbage.

Asking questions such as these might just provide ancestry.com a data source to evaluate user contributions, possibly even use machine learning to assess the validity of hints.

For example, clicking that “I want to save and review later” is an easy indicator for ancestry’s algorithm to say “meh, don’t pay attention to this.”

Not selecting anything at all—which I suspect most ancestry.com users would do—would effectively provide the model the same answer: Don’t pay attention to this user’s input.

Response rate could also give ancestry.com a way to score their users: those that are committed to helping ancestry.com understand their data could potentially be given a higher weighting in a more modern hint algorithm. More important, it could help identify careless researchers, and limit their ability to muddy the waters.

Of course, this begs the question of whether ancestry.com should change their baseline assumption: that the central task in genealogy research is finding someone who’s already researched it.

I’m more intrigued by the behavioral side of it, though: by asking these questions, will users reconsider accepting an ancestry.com hint? Will people ask themselves “Is it just a name? Do that dates really match? Did I check the other people in the record?”

There are drawbacks, of course. Take this photo hint for George Rautzhong. It’s a photo of his tombstone, and I have chosen not to attach these to profiles.

I have two main reasons why I won’t attach a piece of evidence. First is that I don’t like the source—it might be a tertiary source, or just a picture I don’t care about. Second, I just don’t care about the profile: I mean, what do I gain from adding city directory entries for a sibling of my main line? Not much.

Ancestry.com has made an erroneous assumption that people will care equally about all the profiles in their tree.

I’ve been, at best, skeptical of the utility of ancestry.com’s features over the past year. But this tells me that great things could be on the way.

Genealogy records & U.S. Immigration Laws

I’ve often wondered why immigration records in the U.S. suddenly change. My great-grandfather’s record from 1904 was so detailed it listed his aunt’s name and address in Philadelphia. The 1846 ship-list that I think was for my 3rd-great-grandfather just listed name, gender and age. And when my 2nd-great-grandparents moved from Quebec to Minnesota in the 1870s, there was no record at all.

On my wife’s side, I can find lists of her Pennsylvania Dutch ancestors when they and their fellow adult male passengers took an oath of loyalty in Philadelphia, but her Scots-Irish ancestors seem to appear magically in Virginia with no records at all.

Why? I did some research, and it turns out, there are three major periods in immigration law in the United States.

After 1882, you start seeing detailed immigration records that are consistent across the country.

Between 1819 and 1882, you’ll have ship manifests with very basic information, but nothing for land crossings.

And before 1819, its entirely up to the officials at the port of entry.

In 1891, Congress established a Bureau of Immigration, finishing a transfer of immigration control from states and cities to the federal government which began in 1882. Ellis Island was part of this legal period, and from 1882 on, you’ll find standardized, detailed immigration records regardless of the port of entry.

If you use ancestry.com, you’ll find that only the basics have been indexed—name, gender, etc. What hasn’t been indexed is gold, though: each entry notes where and with whom the immigrant will be staying. This is family: see this entry for my great-grandfather, Michael Devaney? He reported he would be staying with his aunt, and she was unknown to everyone, in the U.S. and Ireland, researching that part of the Devaney family.

Prior that the important law was the Steerage Act of 1819, which required that ship captains provide a detailed registry of passengers with name, sex, age, and occupation those data points. You might get a bit more depending on the ship or port, but don’t get your hopes up. What I found for my third great-grandfather George Haggerty, who came to the U.S. in 1848, is pretty typical.

There are three interesting things to note here: first, ship captains could face legal penalties if the lists were inaccurate, so you can expect some degree of validation.

Second, the requirement was only for ship captains debarking passengers at a port of entry. If, for example, you landed at St. John, took a ferry down the St. Lawrence to Montreal, and then walked south to Manhattan, there wouldn’t be any record of your arrival in the United States. Likewise, a captain could drop some or all of his passengers off on an isolated beach on Cape May before sailing up the Delaware to Philadelphia.

Now, both of those scenarios may sound ridiculous, but thousands of Irish did this during the Famine. Why? Well, Congress charged ship captains a head tax for Irish passengers during the Famine. The English, on the other hand, wanted to settlers in Canada, and subsidized the fare to St. John, including the free ferry trip to Montreal.

Before 1819, though, it was entirely up to local governments, and outside of naturalization, most of them did nothing.

Tracking the immigration of English Quakers to Colonial Pennsylvania, for example, is probably going to be through religious sources, not secular. Quakers would carry certificates of removal from their Monthly Meeting in England so they could easily join a new Monthly Meeting in Pennsylvania, and those certificates would be mentioned in those meeting records. But the colonial government didn’t care.

The Puritans also kept detailed records of the Great Migration to Massachusetts, but this wasn’t official government record keeping either.

Virginia, you’ll find next to nothing.

The only exception was naturalization during the colonial period: under English law, any real property owned by an alien would revert to the Crown upon that alien’s death, not to his heirs.

This is why you can find basic information about ethnic German immigrants to the colonies of Pennsylvania and New York: they were eager to swear loyalty to the Crown so that they could pass any farmland they acquired to their children. They weren’t ship lists, they were oaths of loyalty to the Crown given shortly after arriving in Philadelphia.

Disabling ancestry.com’s hints doesn’t actually disable ancestry.com’s hints

Several months ago, I got so fed up with Ancestry.com’s hint system that I did this. Yup, I disabled all hints on all of my trees.

No more little wiggly leafs showing up on profiles distracting me with icons of little angels or records dated decades before my ancestor was born, or after they died.

Did you see that? In the upper left corner? No?

How about now. See those little leafs pop up?

Yeah. So the only thing that ancestry.com’s “hint disabling” feature does is prevent the little leaf from appearing in the top right-hand corner here.

And in those intervening months, Ancestry.com served me up over a thousand of what I have to imagine are useless hints.

Now, I’ll be honest, I’m the one who is incapable of ignoring all of those hints. I know intellectually that maybe only one or two of those will tell me something new, and so I shouldn’t bother to look. I know that I’m the one that can’t control myself. I admit that.

But I also know that ancestry.com has a data analytics team, and I bet you that analytics team has shown that hints drive user engagement, and that continued user engagement is highly correlated with, and possibly causal of, subscription renewals. So ancestry.com knows that if it keeps showing me hints, I’ll have a reason to keep coming back and paying them money without them having to improve their service much.

I do data analytics for a living for a big tech company. This is the kind of thing we get paid to do.

So… I’m going to start looking at ancestry.com competitors. See what else is out there, see if I can find a service that doesn’t distract me, that helps me focus on the genealogy research I want to focus on.

Not that I’ll be able to drop ancestry.com: the service is designed to be sticky, to make it difficult to switch to a competitor because you can’t move your data easily.

It’s a design model that’s falling out of favor in the commercial space because companies hate getting locked-in. But there’s little to stop it in the consumer space.

Ideas to improve Ancestry.com hints

Ancestry.com hints have become completely useless for me, and I have some ideas on how to improve them.

When I started using Ancestry.com nearly a decade ago the hint system was amazing, helping me turn some basic information from my wife’s aunts into a fleshed family tree.

But today, I would estimate that a third of the hints are for tertiary sources that I rarely use, another third are wildly and obviously wrong, and the remaining third are random images or copy/pastes from sites such as find-a-grave.

Moreover, nine of ten hints are for siblings of direct lines (and those siblings’ spouses) that won’t me get through brick walls and that I no longer care about.

In short, maybe one in five hundred hints tells me something new and useful about someone I care about.

Consider Charles Stanford. I know this fellow’s story: my father officiated Charles’ 1968 wedding to Jean O’Neill, my dad’s cousin. My father also delivered Charles’ and Jean’s eulogy after they were murdered by a drunk driver in 1987.

Now look at the hints ancestry.com recently offered about a family story I know.

This one hints that Charles married in 1919 decades before he was born. This one hints that he served in the Marines at the age of two. This one hints he died in Wales even though my profile clearly states he died in Pennsylvania.

Just about every day, Ancestry.com serves up useless hints like these. It’s like panning for gold, washing out pounds of mud in the hopes of finding a dust mote of gold.

Is this mess solvable? Yes. Here’s how.

  1. Set some basic data rules. If the profile I create sets a birth & death year & place, don’t show me hints for other states and countries, let alone for records before that person was born, or after they died.
  2. Even better, let me tune hint accuracy just like I can tune when running searches. At a minimum, let do this at the tree level, but I’d love to be able to adjust person by person.
  3. Let me turn off hints from particular sources. I don’t want to see recommendations for summary genealogies such as North American Family Histories or Sons of the American Revolution. Ever.
  4. Let me turn off hints person by person, even branch by branch. Most of the people in my tree are siblings and their spouses. Once I land the lineage and story for direct ancestors, I’m not going to learn anything else from siblings, cousins and their spouses. I don’t want to delete those people, but I don’t want to have my research distracted by hints about them.

But Ancestry.com should be able to go beyond those simple UI features. Many companies, including Microsoft, offer powerful, easy-to-configure artificial intelligence services that could really make ancestry.com hints useful.

  1. Help me with images. Image classification services, including facial recognition ones, could easily be trained to let exclude hints for icons of immigrant ships, DNA, angels, country flags, gravestones and other random stuff that I don’t care about. And by easily, I mean point-and-click artificial intelligence. Click the card above to see a quick demo of this with Azure’s Custom Vision AI service.
  2. Identify unique story contributions. Ancestry owns Find-a-grave, it can search the web, and it can use commercially available machine-learning services to exclude sources that were just copied from another site, and highlight sources that are unique and specially transcribed.
  3. Rank the hints, and funnel unlikely hints directly to the “undecided” bucket. Or only notify me in the general user interface if a hint is likely to be a match, while hiding the unlikely hints within the profile.

Of course, I don’t expect to see such improvements. To continue growing, Ancestry.com needs to expand their user base, and that means creating new genealogists. Developing features for users like me who will keep paying regardless doesn’t make much business sense.

Unless Ancestry.com realized that I would pay more for this.

Book Review: American Nations

I just finished a 2011 book by Colin Woodward called American Nations. The book is largely focused on trying to understand how North American politics works today based on emigration and settlements of different pre-colonial cultural & religious groups.

That history has a real application to genealogy, at least, the part where we try to understand who our ancestors were, what the believed and how they lived their lives.

Quick summary:

  1. It doesn’t matter where people came from, they become acculturated to the environment in which they live their lives.
  2. People didn’t just migrate to places in the United States where friends and family already lived. They migrated West within their cultural groups.
  3. Woodward has this really cool map of settlement patterns that can help you understand your ancestors’ culture, and
  4. That map can help you make educated guesses about where to look for records as you move back in time.

It’s these last two that I find fascinating and that I think makes Woodward’s book a good purchase. If you want to avoid the modern politics, just skip the last four or five chapters on modern cultural clashes.

Woodward posits that North America has eleven distinct cultural groups or nations, with nine of them pre-dating the Revolution.

I won’t go through them all—read the book, but in my wife’s family, there were really just four:

  1. Tidewater, an aristocratically inclined nation centered around the Chesapeake Bay
  2. Midlands, a moderate, pluralistic, Pacifist nation with Philadelphia Quaker roots.
  3. Yankeedom, a communitarian, utopian-inspired culture founded by New England Puritans.
  4. Appalachian, a libertarian-inclined nation founded by Scots-Irish settlers in the colonial Backcountry.

What’s fascinating is that these borders actually match to different branches in her tree, and I can almost always point to a particular event that had them cross.

First, take a look at Ohio. Woodward has this state split between three nations, Yankeedom, Appalachia and Midlands.

My wife’s maternal line has a lot of Ohio in it. Some Midlands Pennsylvania Dutch, some Appalachian Scots-Irish. For a more than a century, both families moved West within these boundaries.

The Scots-Irish line moved from the Virginia backcountry—what is West Virginia today—through Appalachian Kentucky, then Appalachian Southern Ohio, and finally Appalachian Southern Indiana and Illinois.

The Pennsylvania Dutch family went west through Pennsylvania—but only the Midlands Pennsylvania counties—to Midlands Northern Ohio.

How did the two families cross? Well, one branch of the Scots-Irish line ended up in Illinois along what Woodward asserts was on the Appalachian/Midland border, and then went to Midland Iowa. There, they met up with a branch of that Midland family that went all over Midlands territory, from Kansas to Iowa to Nebraska.

But beyond that, my wife’s maternal line is only Midlands and Appalachia.

My wife’s paternal line? Until the 1840s, there were really six distinct lines: three were Yankee, one dating back to the Mayflower, and another two that were acculturated in existing Yankee communities in New Brunswick and western New York. The fourth was pure Appalachian. The last two were ethnic German, one 100% Pennsylvania Dutch from colonial times, the other Germans from Russia.

These six lines had absolutely no geographic overlap until the mid 1800s. What brought them together? The Oregon Trail. Between 1843 and 1900, each branch went overland

Classifying genealogy-related images with Azure’s Custom Vision Service

One of my pet peeves on Ancestry.com is getting image hints for things like immigrant ship icons, DNA icons, angels, coats of arms, flags, etc.

I don’t begrudge the folks that want to decorate their tree with these badges, but I don’t care about them and I don’t want to see them as hints.

Now, ancestry.com does have a feature where users can note whether an image they’re uploading is a document or picture or what have you. But it’s not like you can search on it, and I suspect I’m one of the few people that bothers with it.

Thing is, machine learning could identify and categorize all the images we upload, make them searchable, and even exclude some—such as angels and DNA icons—from popping up as hints.

It’s not hard—there are many commercially available machine learning services that will categorize images with just a bit of training. Azure Custom Vision Service actually makes training such a model a drag-and-drop experience.

Let me show you.

First, I need to set up the service. It only takes a minute, assuming you already have an Azure subscription, which I do.

Next, I need to provide the Custom Vision service a bunch of images and tell them what they are.

If I were doing this for real, I’d prepare hundreds of images for far more categories. And, as this little warning notes, I would have equal groups for each category.

But for this demo, I think this is good enough.

Let’s test it!

First, let’s try it out on this DNA icon. See the results here? The model is 95% confident this is DNA, though it also thinks it might be a coat of arms.

This tombstone for Alonzo Hawn? The model is 100% confident.

The same goes for this newspaper article, for this photo of my dad, and the Powell family arms.

Interestingly, this “DNA verified” icon really trips up the model: it thinks this is a coat of arms. If I were really building this model, I’d run another iteration of the model after adding a bunch of images like this one.

Now, ancestry.com would have some more development work to do to make image categorization a feature of their service, but the hard part—the machine learning—isn’t hard anymore.